Jump to content

Wikimedia Foundation Annual Plan/2017-2018/Final/Structured Data

From Meta, a Wikimedia project coordination wiki

Structured Data on Commons

Teams: Reading, Community Programs, Technical Collaboration, Research, Wikidata (WMDE), MediaWiki Core, Cloud Services, Discovery/Search, Tech Ops, Performance

Strategic priorities:

  • Knowledge -- the project simplifies workflows and contribution venues for ingesting multimedia and its associated data. Currently, projects lose some of the most valuable information about the media, and this project enables rectification.
  • Communities -- the project is in direct response to long-needed requests from Communities for features like multilingual categories and improved search, which will bolster a number of community workflows.
  • Reach - project focuses on building the infrastructure needed for improving search and reuse of Commons.

Timeframe: 2.5+ years

Description FY17-18 Plan
Structured Data Description of Structured Data Expenses
Staffing Expenses 1,365 10.23 FTE in the Technology, Product departments
Non Staffing Expenses -
Data Center Expenses 16 4 test boxes for Multi-Content Revisions schema changes
Grants -
Endowment Contribution -
Donation Processing Fees -
Outside Contract Services -
Legal Fees -
Travel & Conferences 35 Travel to community conferences and other team travel
Other expenses (Wikidata, Personal Property Taxes) 259 Wikidata staffing expenses from WMDE personnel in support of the Structured Data program; additional payroll fees and personnel related expenses not captured in "Staffing Expenses"
Total Program Expenses 1,675


We are integrating Wikimedia Commons with Wikibase (the technical underpinning of Wikidata) to allow for the use of structured data and the integration of metadata from other sources of content. This is in part a response to a number of requests from the Wikimedia Community, including, but not limited to, better search for Commons media, multilingual descriptions, and categorizations. This affects Reach, Knowledge and Community through developing a deeper relationship with the Wikimedia Commons community, the broader GLAM network, and other reusers of Wikimedia projects -- making it easier to integrate and reuse content, and reach a broader community of learners and readers with educational media.

Note: For more detailed information, please refer to Commons:Structured data/Sloan Grant for background on the grant secured to bolster this work and this slide deck.


Unlock the potential of Commons by making it easier for people to discover, learn, and manage the free media stored on Commons and thereby incentivize higher contribution rates.

  • Short-term outcomes (end of FY 2017-18): Enable core infrastructure for integrating Structured Data into Commons, including Multi-Content Revisions (MCR), and Wikibase federation. Complete initial design research for major stakeholders in Commons, to allow planning for features development and other Commons improvements.
  • Medium-term outcomes (end of Sloan Grant in March 1, 2020): Provide the infrastructure and tools that empower the Wikimedia Commons and GLAM communities to integrate Structured Data into at least 5 million media items on Commons.
  • Long-term outcomes: The Wikimedia Commons community and partners are able to provide robust structured data on the majority of Wikimedia Commons content. Data is leveraged for improved search, discovery and management of media in Commons for Commons contributors, individual visitors, and partners.

Segment 1: Database Integration

Lead Team: MediaWiki Core and WMDE


  • Outcome 1: It is possible to store structured data within wiki pages, in particular on media file pages on Commons. We will enable the MediaWiki storage layer to correctly store and process structured data elements within wiki pages.
    • This will make it possible for editors using a variety of input interfaces to use a standard data entry format for adding content about media. Today, entry of structured data about files is challenging and results in inconsistencies and errors in data entry, metadata retrieval, and user experience.
    • Q1-Q4
  • Outcome 2: Introduce Multi-Content Revisions (MCR), which is a prerequisite for enabling flexible input, display, consumption, and re-use of mixed media stored on wiki. Currently, extremely complicated techniques are required to achieve many of these ends, and some useful functions are impractical.
    • Q1-Q4


  • Objective 1: Extend the MediaWiki storage layer for first-class support of content metadata. Update MediaWiki application code backend to ensure revision retrieval, diff, and page update internals conform to this extended database layer.
  • Objective 2: Enable saving components to use the new backend. Update transaction management facilities to ensure changes to one or more types of content in a page are committed safely in the database and related systems.
  • Objective 3: Update page rendering, diff views, and both browser-based and API-based content retrieval and editing to use the new internals. Additionally, upgrade code of extensions used by Wikimedia and potentially highly popular third-party extensions, which may otherwise rely on outdated assumptions about data layer access and internals.
  • Objective 4: Wikibase Federation supports the ability to use entities (Items and Properties) defined on one Wikibase repository (i.e., Wikidata) on another Wikibase repository (i.e., Wikimedia Commons).


  • Milestone 1: The extended database layer and MediaWiki application internals are deployed to production - with advance notice so that tool maintainers with replicas of the databases are prepared - and seamlessly support existing content consumption and editing workflows. In addition to robust test suites demonstrating correctness, observation of user traffic and feedback should reveal no adverse impacts if this is on track.
  • Milestone 2: True multi-content edits can be successfully saved, accessed, exported, and diffed reliably. In addition to robust test suites demonstrating correctness, observation of early adopter wiki user and bot operator feedback should reveal what’s working and what needs improvement.
  • Milestone 3: Reading and editing user-perceived speeds are not negatively impacted, nor are server resources dramatically exhausted. This will be observable through existing performance instrumentation.
  • Milestone 4: Wikimedia Commons becomes a Wikibase repository, storing Wikibase entities that describe media files. It describes media files using properties and concepts (items) defined on Wikidata, e.g., “license: CC-BY-SA-4.0” or “person-shown: Walt Disney”.

Segment 2: Search integration and exposure

Lead Team: Discovery/Search


  • Outcome 1: Readers, editors, and content re-users can find media using precise queries. This rectifies the current situation, which often requires tacit knowledge about categories and further exhaustive combing through files.
    • Q2-Q3
  • Outcome 2: Readers, editors, and content re-users can more easily find media in a language of their choosing. Presently, users may need to go through a translation service in order to find media for which they know the name in their primary language in order to search, or worse be unable to find media at all even if it’s actually available.
    • Q2-Q3


  • Objective 1: Commons search will be extended via CirrusSearch and Elasticsearch and Wikidata Query Service, to support searching based on structured data elements describing media.
  • Objective 2: Advanced search capabilities (e.g., Wikidata Query Service, SPARQL queries) will be updated to support the more specific media search filters and the relationships to the topics they represent.


  • Milestone 1: Commons community members confirm the most important set of search criteria to be readily available from within web search.
  • Milestone 2: Users are observed as being more satisfied when conducting Commons searches by search criteria such as topic, rights holder, license type, or media quality (e.g., image resolution), as well as by top search filters recommended by Commons community members. Additionally, media search, as part of placement in articles from within integrated experiences such as Visual Editor, becomes capable of greater sophistication.

Segment 3: Data input and migration

Lead Team: Reading:Multimedia


  • Outcome 1:
    • Commons contributors, partners contributing media, individual uploaders, and others interested in classifying structured data about media will enjoy a more seamless, predictable, and bug-free user experience. Currently, the upload and media classification experience (Upload Wizard, File pages) is limited by a lack of ability to easily input, surface and utilize metadata extensively and reliably. This is a multi-year effort and will be worked on but not completed in FY 2017-18. It is included in this document for clarity.
  • Outcome 2:
    • Millions of media files have structured data attached to them, empowering better Commons search and a more useful, consistent display for users consuming media.
    • This is a multi-year effort and will not be completed in FY 2017-18. It is included in this document for clarity.


  • Objective 1:
    • Upgrade wiki upload workflows to support data entry and importation of configurable structured data, using Multi-Content Revisions-aware APIs and Wikibase on Commons.
    • This is a multi-year effort and will worked on but not completed in FY 2017-18. It is included in this document for clarity.
  • Objective 2:
    • Upgrade Media Viewer and File pages to support display of structured data -- in particular, licensing information -- more consistently through use of multi-content revisions-aware APIs and Wikibase on Commons.
    • This is a multi-year effort and will be worked on but not completed in FY 2017-18. It is included in this document for clarity.
  • Objective 3:
    • Provide technical support and guidance to community and partner tool builders who want to support contributory or re-use workflows. This will take the form of documentation and electronic forum discussion (probably primarily on-wiki).
    • This is a multi-year effort and will be worked on but not completed in FY 2017-18. It is included in this document for clarity.


  • Milestone 1: Synthesize existing user research and Commons feedback and preview user experience enhancement prototypes. We will know we are on track based on positive community engagement and user research results.
    • This is a multi-year effort and will be worked on but not completed in FY 2017-18. It is included in this document for clarity.
  • Milestone 2: Prototypes are built and reviewed with core community stakeholders. We will know we are on track based on community feedback about the enhancements by early adopters, including field users observed through user research, of the enhanced functionality.
    • This is a multi-year effort and will be worked on but not completed in FY 2017-18. It is included in this document for clarity.
  • Milestone 3: At least two major tool developers or re-users begin software updates or write new tools to support data entry of structured data or ingestion of structured data for Commons media into their user experiences.
    • This is a multi-year effort and will be worked on but not completed in FY 2017-18. It is included in this document for clarity.

Segment 4: Programs

Lead Team: Programs


  • Outcome 1: As part of a broader effort to connect GLAM data across institutions, we will develop relationships with GLAM allies who will be stakeholders and reusers of Structured Commons and Wikidata. This will allow for the technical teams to test ideas and practical needs with institutions as they develop new features.
    • Q1-Q4
  • Outcome 2: We will develop better understanding of existing needs for Structured Commons, including, but not limited to, media uploads, Wikidata usage for various programmatic applications, and the broader partnership applications of Wikidata beyond the Wikimedia ecosystem. Better case studies, documentation and support in collaboration with WMDE will allow for broader long-term impact.
    • Q1-Q4


  • Objectives 1: Attend movement and GLAM network events that allow us to identify needs and priorities for this stakeholder groups. These will be in collaboration with design research.
  • Objectives 2: Write case studies and documentation for Commons and Wikidata projects that allow project development among Wikimedia Communities and allow us to identify gaps in existing tools.


  • Milestone 1: Develop a group of representatives for at least 10 major GLAM institutions or GLAM networks (DPLA, Europeana, ICOM, IFLA, etc.) ready to test, and provide feedback and advice on structured data on Commons. With these partners and in cooperation with regional or local GLAM volunteers or coordinators, develop strategies for supporting long-term transition to structured data for Wikimedia Content.
  • Milestone 2: Develop at least two workshops or sets of training materials that can be used to upskill existing community members and GLAM partners in using structured data from the Wikimedia Community.

Segment 5: Community Liaison

Lead Team: Technical Collaboration


  • Outcome 1: The Wikimedia communities, GLAM partners, and developers are fully on board with the Structured Data project. They participate in the different stages of planning and development, and adopt the new features.


  • Objective 1: Work with the Structured Data team to plan and implement community collaboration activities from project inception to development and roll-out of new features.
  • Objective 2: Increase awareness with contributors about the ongoing work and how they can participate effectively and reuse the outcomes for their own projects; this includes partners like Galleries, Libraries, Archives and Museums ("GLAM") and developers.
  • Objective 3: Assist community leaders in adopting and spreading the use of new processes and tools into the wider community.
  • Objective 4: Monitor Commons/Wikidata on emerging issues about software product, with the assistance of volunteers.


  • Milestone 1: Appropriate documentation for community to maintain transparency into work, including:
    • A high-level project description and roadmap, maintained together with the product managers, which reflects progress and upcoming milestones, and points technical and non-technical volunteers to goals or tasks where feedback and contributions are welcome.
    • A stable stream of selected updates about the project, maintained together with the product managers, targeted mainly to the Commons and Wikidata communities, GLAM partners, and developers that is updated at least once a month and includes all consultations, other calls to action, and project decisions.
    • Instructions for volunteer testers to get involved and provide useful feedback on-wiki or in Phabricator.
    • Newcomer-proof user help documentation, either new or updating existing docs, covering new features as they enter the beta and stable stages, with illustrative screenshots and screencasts.
    • * All content should be concise, use plain English, and be translatable.
  • Milestone 2:
    • Presentations and workshops about the Structured Data project are submitted to the main Wikimedia events and are made available online.
    • Very active contributors and other community leaders receive personal invitations to learn about new features, receive support if needed, and are encouraged to share their feedback publicly or to the team.
  • Milestone 3:
    • Relevant technical or social issues reported by the communities in project pages or their main venues are submitted to Phabricator and/or escalated to the product managers.
    • Zero unanticipated clashes with the communities. CL will ensure that development teams are aware of potential and emerging points of conflict.