Content reuse

From Meta, a Wikimedia project coordination wiki

To date, Wikimedia content has mostly been optimized to be understood by humans, not machines. This is not surprising, given our history and status as the world’s largest collaboratively generated store of knowledge. However, the way in which knowledge is accessed is changing. As global access to the internet continues to grow, Wikimedia website direct readership has remained effectively static since 2013 even as reliance on Wikimedia content by large re-user organisations increases.

Background[edit]

Right now, it’s difficult for, except for the largest companies, to easily reuse Wikimedia content. Programmatic access to our full corpus to anyone is available, and Wikidata has shown how valuable open knowledge inside a structured data format can be. Nonetheless, we do not currently provide a consistent set of tools or methods for accessing comprehensive, well-structured, machine-readable versions of our content across all our projects. As a result, it takes significant time and resources to build and maintain access to our data for reuse in other environments. As a direct consequence, only a relatively small set of well-funded organizations are able to make the considerable up-front and ongoing resource commitment that is required to collect and put our data to use in their own systems. This results in the reinforcement of monopolies, This has several negative impacts: it undermines our movements long term sustainability, restricts the diversity of potential re-use, and results in inconsistency in the way that knowledge is shared (including inconsistent attribution).

In the words of the Wikimedia strategic direction: There is no knowledge inequity if there is no knowledge service.

What is content reuse?[edit]

Wikimedia content -> Access -> Ingestion -> Extraction -> Integration -> Distribution -> Attribution -> END USER


Not all stages apply to all reusers, but all reusers will utilise some stages of this life cycle. Researchers will likely follow the first three or four stages: access through to integration. This would then be followed by further processing, analysis, reporting and/or visualisation. Reusers with less mature setups might skip the integration with a larger ontology. The largest users with the most mature and complex processes will utilise all steps in this process.

Access to content by reusers is achieved through three broad means: Scraping, Dumps, and APIs. Access is managed through rate limiting, access keys, and other authentication methods.

Transmission or ingestion of content by reusers can be broken down into two main methodologies:

  • Pull - Can be manual or automatic. Pulling data from an API or downloading regular dumps
  • Push - Currently the only push offering provided is the EventStream platform.

Extraction and transformation of content occurs when HTML, Wikitext and JSON exports are parsed to create components derived from infobox, lede, article body, and references. These derived components are then stored internally within knowledge warehouses.

Content integration involves the linking of derived components within a broader ontology linking multiple data sources, frequently referred to as a “knowledge graph”. This data structure is key to reusers being able to provide additional context from related topics or concepts

Distribution of content is achieved when derived components or integrated ontologies are utilised by products. For example, in the case of digital assistants: queries from users are interpreted via natural language processing to identify a particular topic or subject matter. The ontology or knowledge graph is polled for information and the response is relayed back to the user.

Attribution to the original source of the data is appended (when required by the license or where not required simply to clearly identify provenance). This can be a written attribution on websites or audio attribution in the instance of digital assistants.

What are the benefits of reuse to the Wikimedia movement?[edit]

Reuse enables greater reach of free knowledge and movement philosophy - - Our mission is to enable everyone to share in the sum of all knowledge. No matter where they are. No matter what device or medium they use. Reuse acts as a force multiplier for achieving our mission to provide the sum of all human knowledge. While people do not need to know who we are to benefit from the work of our volunteer communities and our movements values, our movement is more sustainable if they do.

We also can’t achieve our mission by ourselves. We only reach at most 15% of the world's population. In terms of volunteers, staff and resources, we are only a mid-sized organisation. If we are serious about our mission we must have help and we are not the only people working towards improving access to knowledge. Syndication and reuse of the movement's data helps increase its spread beyond our current audience. Reusers act as partners in our mission, furthering our success as we try to fulfill that mission.

Greater reach leads to greater discovery of Wikimedia - Reuse when done well can enable discovery of the Wikimedia projects when properly attributed. The majority of people who learn about Wikipedia do so via the internet, from classroom settings or from friends and family. In the first two instances this is frequently when Wikimedia content is used to provide additional context and not simply by linking, and is also properly attributed. Properly attributing Wikipedia content by third parties, especially our largest reusers, is not just about compliance with licensing but forms a fundamental driver of awareness of the Wikimedia projects.


Our reach enables us to provide context in 3rd party settings - People are always looking for answers to the questions they seek and this manifests itself online more and more each year. People are interacting with Wikipedia in a variety of settings and our content enables people to readily access facts, ideas, and concepts from within the settings they are present in. This context can be provided by humans or machines. Human provision of this context is done at an individual level, person to person or sharing within a network of individuals; such as in social media, real time messaging platforms, integrated into classroom learning or linking from websites and articles. Alternatively context can be provided by programmatically, with a greater degree of autonomy, done at a greater scale and typically involves integration with other products and this is seen in search engines, social media and various other have all started integrating Wikipedia into their settings.

Provides context - People are always looking for answers to the questions they seek and this manifests itself online more and more each year. People are interacting with Wikipedia in a variety of settings and our content enables people to readily access facts, ideas, and concepts from within the settings they are present in. This context can be provided by humans or machines. Human provision of this context is done at an individual level, person to person or sharing within a network of individuals; such as in social media, real time messaging platforms, integrated into classroom learning or linking from websites and articles. Alternatively context can be provided by programmatically, with a greater degree of autonomy, done at a greater scale and typically involves integration with other products and this is seen in search engines, social media and various others have all started integrating Wikipedia into their settings.

Counters disinformation - In its early days Wikipedia was derided for lack of accuracy, but that position has broadly changed. The breadth and depth of Wikipedia along with its accuracy and reduced bias within certain topics has been the focus of scientific studies and media coverage over the years. The communities’ dedication to the integrity of our content means that our content is now used as a tool against disinformation elsewhere on the web itself.

What problems exist with reuse of Wikimedia content?[edit]

The barrier to entry for ingestion & translation of our corpus is considerable. - A few companies can do this - most cannot. The few organizations that reuse our content have to put significant ongoing resources into reusing and integrating our content. It creates significant burdens even on those who can afford to do so. This limits the potential for reuse of our content. It limits our usefulness and longevity as a project. When reusers of Wikimedia data at scale want to access our corpus of information right now, they have to develop their own software to do so. We make it challenging for most of the use-cases where users need to accurately and consistently resurface our content. Whilst we are a theoretically open and open access project, practically we fall short.

That barrier to entry is potentially reinforcing bias in knowledge access and limiting access to our other projects. Conversations with full-corpus reusers suggest that due to the technical barrier for usage of our content, they are typically only utilising the minimum number of languages or projects. This limits the discoverability of sister-projects and smaller languages. The byproduct is that we are inadvertently creating a self-reinforcing bias towards public awareness, use, and contribution towards only the larger language editions and sister-projects.

We have very little insight into how our data is reused. As there is no consistency in how reusers use our content, we have little insight into how that content is reused, nor any way to ensure reuse requirements are addressed. This includes the ability for individuals to flag misinformation; to provide or suggest content additions and improvements; or adherence to content licenses. The lack of a consistent platform for reuse makes it harder to understand who uses us, how we can better support them and create a deeper relationship with them.

3rd party API usage impacts our communities and impedes their work. Currently rate limits on API usage are global. A global tech corporation use the same infrastructure as a volunteer bot developer with a shared bandwidth. In principle this sounds egalitarian, but the reality is that a handful of large corporate users are responsible for extensive API usage. The calling of our APIs tens of millions of times per day by the largest users frequently disrupts bot users, particularly those using the Wikidata query service as part of their functionality, results in community tools being slowed down by global rate limits reducing their effectiveness and useability.

We don't yet know how to ensure the recruitment of editors via 3rd parties and maintain healthy communities the biggest challenge we need to solve in the long run is how we recruit editors via 3rd parties. Without this piece the long term health and sustainability of the community could be more uncertain.

See Also[edit]