Jump to content

Research:External Reuse of Wikimedia Content/Background

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T235780

This page is an attempt to bring together and categorize what we know about external reuse of Wikimedia content -- alternatively referred to as syndication or simply as reuse in other places. It is based on prior attempts, personal research / digging, and discussions with interested colleagues.

What is external reuse? I am defining external reuse broadly to be any instance of Wikimedia content appearing on platforms outside of Wikimedia. Much of this is expected and encouraged -- e.g., a snippet of Wikipedia text alongside results in a search engine, an individual using an image from Commons on their blog. Anecdotally, however, external reuse is increasing in volume -- i.e. more people are consuming Wikimedia content outside of the Wikimedia ecosystem. External reuse is also becoming more complex as platforms translate content, seek to directly answer user queries, and operate through media such as voice assistants. This raises concerns about the completeness of metrics like pageviews, whether readers are aware that this content comes from Wikimedia, and the degree to which these readers become contributors.

Below I have developed a series of high-level categories of external reuse that seek to capture the commonalities in mechanism, availability of data, impact on Wikimedia projects, and, as a result, priority for further study. For each category, I've attempted to provide a summary, collect pertinent examples, and describe existing research and data regarding how this type of external reuse impacts Wikimedia projects (and the platforms that engage in reuse). Included are many examples of reuse (and many more were left out). Due in large part to the difficulty of obtaining data on actual user behavior on platforms like Web Search or Voice Assistants, however, there is very little research on most types of reuse despite their importance. Out of scope for this project is the reuse of Wikimedia data by machine learning models -- a very large area worthy of its own study -- and reuse of Wikimedia data across different projects -- e.g., transclusion of Wikidata in Wikipedia articles -- which is also quite important but a separate topic.

Categories

[edit]

Large-scale Portals

[edit]

Mirrors, portals, and offline access encompasses Wikimedia content that can be viewed in full outside of Wikimedia. This ranges from very laudable projects that aim to make Wikipedia accessible to areas without good internet access to just a different interface for Wikipedia content that is arguably an improvement (usually with the addition of advertisements though) to malicious bulk copying of content without providing links or attribution (piracy). Of note, Kiwix and Internet-in-a-box are special cases that are very important for offline access and generally quite different from many of the other unaffiliated mirrors / apps. This category does not include, for example, the use of Commons images on a blog, which is not en-masse.

There is no strong evidence at this point to suggest that mirrors (outside of offline access) have managed to attract significant traffic. While mirrors certainly could pose challenges to awareness and the ability to attract editors, they likely have had little effect on page views or public opinion. In part to address concerns that some of the examples below could lead to confusion for readers, the External Guidance extension was developed to provide a funnel for bringing these readers more seamlessly into Wikipedia (specifically for translated content).

Examples

[edit]
Offline Access
[edit]

Kiwix is an offline web browser that allows individuals to access Wikipedia and other public resources, in particular in regions where online access is difficult or exorbitantly expensive. It has over 500,000 Android installs as of January 2020 and an unknown number on iOS. Other examples include Internet-in-a-Box and a few other applications developed by Wikimedia Switzerland.

Proxies
[edit]

A number of proxies have been developed, as typified by Google, that either seek to speed up access to websites or translate them into a different language for readers. While these proxies may be well-intentioned, they can create confusion depending on how they handle entry points into editing.

Google Web Light is a proxy that attempts to support low-speed data connections. It is unclear if Web Light changes anything about a Wikipedia page, but it can affect page view analytics as the traffic appears to originate from Google's servers as opposed to individual devices. Internal data suggests that approximately 2.5% of page views across all projects are served via Web Light as of January 2020.

Google Translate provides machine-translated versions of articles to readers. Generally it requires that the reader request the translation but the extension can also detect and automatically translate articles based on user settings. While usage of Translate is not always detectable depending on the method used, this analysis of Google Translate for Indonesian Wikipedia indicates that at least 30 thousand page views per day come through user-initiated Google Translate. This analysis from March 2019 indicated that most translate requests have English Wikipedia as the source and this simple query suggests that almost half a million page views a day occur via Google Translate across all projects.

See this list of common browsers that implement proxies for more examples.

Mirrors
[edit]

Wikiwand (and other unofficial apps and mirrors) provide alternative interfaces for viewing Wikimedia content. These interfaces can be stand-alone websites, browser extensions, or mobile apps that wrap Wikipedia content with a supposedly richer interface, but also generally add advertisements. Usage is unclear, though Wikiwand as a higher profile example has a couple hundred thousand downloads per the Chrome and Mozilla web app stores. There is also anecdotal evidence that other encyclopedias have copied content from Wikipedia without appropriate attribution.

Research

[edit]

Beyond the miscellaneous statistics noted above and anecdotal evidence, little research has investigated the extent of these platforms. The most extensive accounting comes from Alshomary et al. 2019,[1] we see that it is estimated that nearly 50,000 websites reuse English Wikipedia content -- most without any form of attribution and the addition of advertising -- and that these pages are estimated to collectively bring in $5.5M USD in advertising revenue per month.

Linked Open Data

[edit]

This category covers reuse that is akin to linked open data where data structures are linked to support transclusion of content on external sites and to allow data to flow back and forth between platforms. A good example internal to Wikimedia is the tight coupling between Wikidata and Wikipedia. These examples generally help make Wikimedia content more central and discoverable to the web while also providing a means of importing data into the projects. While they may not strongly affect traffic, they can still help support content creation and strengthen other open projects.

Examples

[edit]
  • OpenStreetMap includes Wikidata identifiers and Wikipedia entries for places on their map. See their wiki for more information.
  • Schema.org properties such as sameAs, which allow for linking of web pages to concepts (often via Wikidata). For example, Wikipedia + sameAs or web pages that link to Wikipedia or Wikidata for fact-checking
  • DBPedia ingests and converts Wikipedia's XML dumps into structured data. It then makes structured data available via dumps and SPARQL endpoints. It is used by hundreds of Linked Open Data services and applications.
  • A number of GLAM institutions have explored stronger connections between their databases and Wikidata or Commons. See this report for more information.
  • iNaturalist links to Wikipedia articles in their "About" section for species (more info) and the iNaturalist taxon ID is included as structured data on Wikipedia articles. There has also been a fair bit of photos / information shared back to Wikimedia from the iNaturalist community (Commons example). When a species lacks a Wikipedia article, iNaturalist also suggests that users create the article and provides a starting template.
[edit]

This covers instances in which outside services provide a direct search into Wikimedia. This is different from Google Search etc. because it is only indexing Wikipedia -- i.e. the search either retrieves a Wikipedia article or no result -- and often has unclear referral information because it appears as a direct referral. While this generally is a positive use-case for Wikimedia reuse, there is very little information regarding the usage.

Examples

[edit]
  • DuckDuckGo supports Wikipedia bangs (!w <search query>) that directly search Wikipedia
  • Dictionary on Mac[2] does look-ups though the searches pass through Wikimedia servers and thus are countable.
  • Mozilla Firefox has Wikipedia as an available search engine in the toolbar but not the default (Google is), so Wikipedia receives presumably very little traffic in this manner.
  • Amazon Kindle allows for searching terms in books directly on Wikipedia.

Snippets (Search)

[edit]

These are examples where small segments of Wikimedia content are algorithmically evaluated against other sources and then surfaced on search platforms outside of Wikimedia projects with attribution and links back to Wikimedia where required. These are generally in good-faith but the long-term impact on Wikimedia is unclear and the details vary greatly. Search is the mechanism by which the vast majority of readers reach Wikipedia.[3] Snippets range from very short -- e.g., a few words -- that attempt to directly address a query and do not provide links or attribution to longer snippets -- e.g., 30 or 40 words -- that provide a preview of the content with a clear link to read more.

Examples

[edit]
  • Google surfaces Wikimedia content in a number of ways:
    • As a traditional, organic link in the search results
    • As an organic link but one that is an auto-translated version of a Wikipedia article from another language -- e.g., as with Project Toledo and Indonesian Wikipedia.
    • As a featured snippet in which Google provides a more direct answer to a query. Though a few years old, this review by ahrefs provides a good overview of featured snippets (and the associated data suggests that List Articles on Wikipedia are commonly featured in snippets).[4]
    • As knowledge panels that provide a rich search result with a snippet of text, often facts from the associated Wikipedia article or Wikidata, images (sometimes from Commons), and a link to read more.
      • The text in these knowledge panel results can also include translated snippets a la Project Toledo that then link to the original language version of the article.
    • Finally, in other Google products such as Maps or Earth, a "quick facts" section often links to Wikipedia if you query, for example, a city with a Wikipedia article.
  • DuckDuckGo surfaces Wikipedia articles both as organic links and knowledge panels in its search results. The main difference with Google is that DuckDuckGo tends to provide purely Wikipedia knowledge panels -- i.e. no additional frills / translations -- whereas Google more often remixes the content with other sources or translations.
  • Microsoft runs Bing, which also often provides organic search results and knowledge panels. The main difference is that Bing makes almost the entire Wikipedia article available via skipping between sections on the organic link without directly visiting Wikipedia.
  • There are many more important search engines around the world (e.g., Yandex in Russia, Naver in South Korea, Yahoo! in Japan) that all largely follow the same pattern. Naver does not appear to extensively link to Wikipedia within their knowledge panels though, but instead relies on its own Doopedia.
  • Amazon:
  • Wolfram Alpha is a knowledge base that has its own UI and also supports many other technologies via APIs. Its search results often include Wikipedia data (as indicated by the Wikimedia Foundation listed as a source in the buried footnotes), but this presumably drives zero traffic via the UI as of March 2020 there is no direct linking between facts presented and Wikipedia articles (just a generic link to wikipedia.org and reference to the Wikimedia Foundation). For example, see these pages for Bambi and Mountain View.

Research

[edit]
  • Per SparkToro,[5] over 50% of searches on Google now do not result in a click to an external site such as Wikipedia. This reflects a growing trend of people receiving the answers directly from search engines, which raises concerns about the long-term longevity of sites that provide these answers.
  • From McMahon et al. 2017,[6] we see that Wikipedia is a major source for these rich search results and would be the destination of many more users if the rich search results did not exist. We also see that many people use search engines to reach sites like Wikipedia because they know that the search engine will surface the content they are looking for.
  • From Vincent et al. 2019,[7] we see further evidence of the value of Wikipedia to Google Search. Namely, that across many types of queries, Wikipedia is the most prominent domain and therefore central to the ability of search engines to provide high quality results to their users.
  • Per the Wall Street Journal,[8] Google makes several thousand changes to its search algorithms every year. This makes it difficult to tie changes in a site's traffic or brand to any specific algorithmic change in Search.
  • That being said, per Business 2 Community,[9] when Google removed links that appeared in featured snippets, this led to a small drop (1.7% median; 6.9% average) in traffic to Wikipedia articles that met this criteria.
  • See this analysis for an analysis of the impact of external automatic translations of Wikipedia articles.
  • A research study on Wikipedia traffic on DuckDuckGo[10] found that Wikipedia is the most prevalent search result, information modules to Wikipedia mediate a substantial proportion of clicks but that removing the information module does not lead to a corresponding drop in clickthrough rate.

Snippets (Voice)

[edit]

Voice assistants share a lot of similarities with web search but the medium leads to substantial differences in data (no hits to Wikimedia servers so no sense of scale of encounters, trends, etc.), likely impact (branding more difficult via audio and no clickthrough to site really possible), and usage (different types of needs probably met through voice than traditional web search).

Examples

[edit]
  • Apple does not run a traditional search engine but does have a very popular voice assistant Siri. Though it is unclear how often Siri relies on Wikipedia for providing answers to queries, it is one of several sources directly incorporated into the assistant's knowledge graph.[11]
  • Amazon runs Alexa, which is a voice assistant that uses Wikipedia extensively for providing answers. See these meeting notes for more information.

Research

[edit]
  • Dambanemuya and Diakopoulos[12] conducted an audit of the Amazon Alexa voice assistant and found that Wikipedia was the most common attribution but most query results were unattributed.

Automatic Fact-Checking

[edit]

These are instances where links back to Wikipedia are automatically inserted by platforms into their site to provide context about sources (e.g., BBC, RT) or problematic content like conspiracy theories. This form of reuse is similar to snippets but the context is very specific and generally Wikipedia is the only source considered. Currently this is done by Youtube and Facebook. Early evidence would suggest that these links lead to very few clicks but this form of reuse could potentially be an important use case for Wikipedia given the growing alarm over disinformation.

Examples

[edit]
  • Facebook uses Wikipedia to provide context for publishers or articles.[13]
  • Youtube uses Wikipedia to provide context for conspiracy theories and publishers,[14] though usage has been documented to be somewhat haphazard.[15]

Research

[edit]

Human-Generated Content

[edit]

This category covers organic links to Wikipedia that are generated by users on external platforms that can help surface Wikimedia content to readers on the web. Examples include user posts that use Wikipedia content in-line as a reference (e.g., StackOverflow, Quora) or directly as a cool fact (e.g., Twitter, Reddit). Generally this sort of linkage is encouraged and while it is a small proportion of overall traffic to Wikimedia sites, it serves as an important alternative to search traffic.

Examples

[edit]
  • Twitter currently does not support rich display of Wikipedia articles linked to in tweets (see T213505 for some context), which presumably limits engagement with Wikipedia content on Twitter. Other platforms (Facebook, WhatsApp, etc.) generate their own previews to varying degrees of success.
  • On Facebook, groups like this one exist to share Wikipedia links. Wikipedia links also are presumably shared naturally through newsfeed posts but there is no evidence of large traffic to Wikipedia as a result. Facebook does render Wikipedia images / previews.
  • On Reddit, the TIL (Today I Learned) community frequently includes Wikipedia links as posts that can generate substantial traffic to those pages.
  • On StackExchange, Quora, and other Q&A sites, users often reference Wikipedia pages in their responses, generally with links through to the content.
  • TikTok is a videosharing platform that allows their users to include Wikipedia links that are pertinent to the videos they upload.[16] Usage is unclear at this juncture but TikTok has a lot of users and Wikipedia is unique as an external link that can be included.
  • Various news organizations include links to Wikipedia to enrich content in their articles. For instance, Ars Technica links to Wikipedia in the "Hello World" text here.

Research

[edit]

Vincent et al. 2018[17] showed that Wikipedia is a common reference on sites like StackOverflow and Reddit and inclusion of Wikipedia links is a strong indicator of a high quality post on these sites. They also found minimal evidence that external links to Wikipedia on these sites drives much traffic (or editing) to Wikipedia. The exception was that for low-quality articles, the impact of external links can bring substantive attention. Erickson et al. 2018[18] showed that 27% of images on Commons have been reused in commercial contexts outside of Wikimedia.

References

[edit]
  1. Alshomary, Milad; Völske, Michael; Licht, Tristan; Wachsmuth, Henning; Stein, Benno; Hagen, Matthias; Potthastt, Martin (7 April 2019). "Wikipedia Text Reuse: Within and Without" (PDF). European Conference on Information Retrieval. doi:10.1007/978-3-030-15712-8_49. Retrieved 18 February 2020. 
  2. Says, Erik (16 November 2009). "Access Wikipedia from Spotlight in Mac OS X". OS X Daily. 
  3. Andreescu, Dan; Gordon, Kinneret; Johnson, Isaac; Perry, Nicholas. "Searching for Wikipedia". [[WM:TECHBLOG]]. Retrieved 15 October 2021. 
  4. Soulo, Tim (30 May 2017). "Ahrefs' Study Of 2 Million Featured Snippets: 10 Important Takeaways". SEO Blog by Ahrefs. Retrieved 24 March 2020. 
  5. Fishkin, Rand (13 August 2019). "Less than Half of Google Searches Now Result in a Click". SparkToro. Retrieved 2 March 2020. 
  6. McMahon, Connor; Johnson, Isaac; Hecht, Brent (2017). "The Substantial Interdependence of Wikipedia and Google: A Case Study on the Relationship Between Peer Production Communities and Information Technologies". International AAAI Conference on Web and Social Media (ICWSM). Retrieved 2 March 2020. 
  7. Vincent, Nicholas; Johnson, Isaac; Sheehan, Patrick; Hecht, Brent (2019). "Measuring the Importance of User-Generated Content to Search Engines". Proceedings of the International AAAI Conference on Web and Social Media 13: 505–516. ISSN 2334-0770. Retrieved 2 March 2020. 
  8. Grind, Kirsten; Schechner, Sam; McMillan, Robert; West, John (15 November 2019). "How Google Interferes With Its Search Algorithms and Changes Your Results". Wall Street Journal. Retrieved 2 March 2020. 
  9. Gleason, Derek (3 February 2020). "No More “Double Dipping” on Featured Snippets—Does It Matter?". Business 2 Community. Retrieved 2 March 2020. 
  10. Johnson, Isaac; Perry, Nicholas; Gordon, Kinneret; Katz, Jon (23 September 2021). "Searching for Wikipedia: DuckDuckGo and the Wikimedia Foundation share new research on how people use search engines to get to Wikipedia". Diff. Retrieved 15 October 2021. 
  11. Lardinois, Frederic (10 June 2013). "Apple Updates Siri With Twitter, Wikipedia, Bing Integration, New Commands And Male And Female Voices". TechCrunch. Retrieved 30 March 2020. 
  12. Dambanemuya, Henry Kudzanai; Diakopoulos, Nicholas (22 April 2021). "Auditing the Information Quality of News-Related Queries on the Alexa Voice Assistant" (PDF). Proceedings of the ACM on Human-Computer Interaction 5 (CSCW1): 83:1–83:21. doi:10.1145/3449157. Retrieved 15 October 2021. 
  13. Hughes, Taylor; Smith, Jeff; Leavitt, Alex (3 April 2018). "Helping People Better Assess the Stories They See in News Feed with the Context Button". Facebook. Retrieved 20 February 2020. 
  14. Matsakis, Louise (13 March 2018). "YouTube Will Link Directly to Wikipedia to Fight Conspiracy Theories". Wired. Retrieved 20 February 2020. 
  15. Kofman, Ava (22 November 2019). "YouTube Promised to Label State-Sponsored Videos But Doesn’t Always Do So". ProPublica. Retrieved 20 February 2020. 
  16. ali, agha. "Curious what TikTok app is adding in its new update? Let’s find out!". Digital Information World. Retrieved 20 February 2020. 
  17. Vincent, Nicholas; Johnson, Isaac; Hecht, Brent (April 2018). "Examining Wikipedia With a Broader Lens | Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems" (PDF). CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3173574.3174140. Retrieved 20 February 2020. 
  18. Erickson, Kristofer; Perez, Felix Rodriguez; Perez, Jesus Rodriguez (August 2018). "What is the Commons Worth? Estimating the Value of Wikimedia Imagery by Observing Downstream Use" (PDF). doi:10.1145/3233391.3233533. Retrieved 20 February 2020.