Research:Newsletter/2018/February

From Meta, a Wikimedia project coordination wiki
Wikimedia Research Newsletter

Vol: 8 • Issue: 02 • February 2018 [contribute] [archives]

Politically diverse editors write better articles; Reddit and Stack Overflow benefit from Wikipedia but don't give back


With contributions by: Barbara Page, FULBERT, Steve Jankowski and Tilman Bayer

Politically diverse editors and article quality[edit]

Controversy is an organized sport for some editors but may alert readers that there is more than one view on a topic.
"The Wisdom of Polarized Crowds"[1]
Reviewed by FULBERT

While politics in the United States appears to be increasingly polarized around extremes in political discourse, it was unclear how this affected the open, collective production of knowledge that is Wikipedia.

The researchers used a data dump of English Wikipedia from 12/1/16, including all edits made since its start within the domains of politics, social issues, and science. They focused on the "American liberalism" and "American conservatism" categories and sub-categories as delimiters, with breakdowns in social issues and science down four levels from the root. The researchers reached out to the Wikipedia community, Wikimedia staff, and those who directly inquired on the page they created through Meta-Wiki, with 118 responses overall for their survey. The researchers then analyzed user edits to determine political alignment based on contributions to conservative or liberal articles.

The researchers found that "articles attracting more attention tend to have more balanced engagement from editors along the conservative-liberal spectrum" (p. 4). They then measured the quality of articles using a tool developed by Wikimedia research staff (ORES), and determined that higher political polarization was associated with higher article quality. All this fed into their study goals of exploring the relationship between diversity of political alignment and article quality and bias. Through their statistical analysis, they determined that the quality of articles in Wikipedia improves when editors on both sides of politically polarized issues work together to seek collaborative consensus on topics. While this research was directly focused on politically-related topics, it surfaced both a need for political diversity and for motivated contributors.

(Cf. related earlier coverage: "Being Wikipedian is more important than the political affiliation", "Cross-language study of conflict on Wikipedia")

The study of controversy[edit]

"Computing controversy: Formal model and algorithms for detecting controversy on Wikipedia and in search queries"[2]
Reviewed by Barbara Page and Tilman Bayer

This paper presents a "method for automatic detection of controversial articles and categories in Wikipedia", based on three data sources:

  • Ratings submitted by readers via the Article Feedback Tool (AFT) in 2011 and 2012
  • The list at Wikipedia:List of controversial issues (manually maintained by Wikipedia editors)
  • A sample of 512 sections drawn randomly from the talk pages of articles on that list ("Surprisingly, only 19.5% of the sections turned out to be controversial").

The researchers argue that applying a mathematical model to Wikipedia talk page controversies has the potential of incorporating a 'controversy' metric in web-searches. This should give those searching for information on a topic a way to quickly assess controversial topics. Wikipedia provides researchers with accessible and historical controversial discussions. The authors further describe their work: "[Assessing] the controversy should offer [readers] a chance to see the 'wider picture' rather than letting [them] obtain one-sided views." The authors' conclusions were: "Our approach can be also applied in Wikipedia or other knowledge bases for supporting the detection of controversy and content maintenance. Finally, we believe that our results could be useful for...understanding the complex nature of controversy..."

Students edit but still doubt the value of Wikipedia[edit]

"Wikipedia in higher education: Changes in perceived value through content contribution"[3]
Reviewed by Barbara Page

Students are a convenient group to study, especially if being studied is part of the syllabus. The 240 students in this study readily admitted to using Wikipedia as a resource even though they did not be consider it to be 'reliable and trustworthy'. Using Wikipedia as a resource does not necessarily encourage content contributions by students. In addition, when the students in this study actually added content, their perceptions of the reliability and usefulness of Wikipedia did not change.

(For coverage of various other papers studying the use and perception of Wikipedia by students, see also our 2017 special issue on Wikipedia in Education)

Researching the research using Wikipedia as a corpus[edit]

"Excavating the mother lode of human-generated text: A systematic review of research that uses the Wikipedia corpus"[4]
Reviewed by Barbara Page

The amount of research that uses Wikipedia as a source of data continues to grow and enough scholarly content now exists that systematic reviews are available. Computer science has especially been quick to see the potential of this 'mother lode' and how it can be used to study information retrieval, natural language processing, and ontology building. The reference section in this article itself makes interesting reading if only to appreciate the collection of data sets and other research that exists and continues to expand.

(See also our earlier coverage of literature reviews, some involving the same authors: "A systematic review of the Wikipedia literature", "'Wikipedia in the eyes of its beholders: A systematic review of scholarly research on Wikipedia readers and readership'", "Literature reviews of Wikipedia's inputs, processes, and outputs")

Sneaky editing and masking bias[edit]

"Persistent Bias on Wikipedia: Methods and Responses"[5]
Reviewed by Barbara Page

Apparently, Wikipedia editors are not the only ones who have observed biased editing. The author of this research article (already mentioned in a previous issue) used his own article as a case study and example of biased editing. It is no surprise that an editor can 'nominally' follow editing guidelines to maintain their bias. Here is the 'how to' on such behavior:

  • deleting positive material
  • adding negative material
  • using a one-sided selection of sources
  • exaggerating the significance of references and topics

Those who are biased sometimes support their editing even in 'the face of resistance'. This is done by:

  • reverting edits
  • selectively invoking Wikipedia rules
  • overruling (bullying?) resistant editors

When bias is challenged by other editors, the strategies for dealing with it is making complaints, 'mobilizing counterediting', and exposing the bias. The authors' stinging conclusion speaks for itself: "It is worthwhile becoming aware of persistent bias and developing ways to counter it in order for Wikipedia to move closer to its goal of providing accurate and balanced information."

Seeking credibility[edit]

"Information Fortification: An Online Citation Behavior"[6]
Reviewed by Barbara Page

This study is a rebuttal to a 2005 position paper by Forte (one of the authors) and Bruckman, which had drawn "on Latour’s sociology of science and citation to explain citation in Wikipedia with a focus on credibility seeking". Citing sources is associated with other issues of bias and identifies the patterns used to in citing sources to encourage and even fabricate controversy. This study was limited to non-scientific topics and used data derived from edit logs, interviews and text analysis. "[I]nformation fortification [is] a concept that explains online citation activity that arises from both naturally occurring and manufactured forms of controversy."

Anti-vandalism on Wikidata[edit]

"Overview of the Wikidata Vandalism Detection Task at WSDM Cup 2017"[7]
Reviewed by Barbara Page

Vandalism of Wikidata can have significant disruptions in the use of the data leading to flaws in the analysis of such data. Collaborative efforts continue to address these concerns and included some friendly 'competitions'. Strategies for 'fighting' vandalism at this time include manual review, community feedback, and analyzing reverting patterns. Other 'vandalism' fighting tools are being developed. Interesting is the discussion about the effort to use "psychologically motivated features capturing a user’s personality and state of mind..."

Wikipedia's one-way relationships with Reddit and Stack Overflow[edit]

"Examining Wikipedia With a Broader Lens: Quantifying the Value of Wikipedia's Relationships with Other Large-Scale Online Communities"[8]
Reviewed by Steve Jankowski

There is a growing body of literature that examines Wikipedia's role in creating value for other websites as part of a media ecosystem. Adding to these studies is the work of Vincent, Johnson & Hecht who examined the bidirectional value created for Reddit and Stack Overflow. Conceptually, the authors distinguished between two sets of metrics to define this value. For Reddit and Stack Overflow, they understood value as being a function of user engagement (score/votes, comments, page views) that is contextualized by potential revenue. For Wikipedia, value is likewise seen as user engagement, characterized by edit count, editors gained, editors retained, and article page views, but is not contextualized by revenue (p.4).

Based on this operationalization of value, the authors assessed the amount of content and links created through associative and causal analyses. They found that Wikipedia provided substantial value to Stack Overflow and Reddit. Most clearly, they illustrated this by explaining how posts containing Wikipedia links gained engagement levels that were estimated to be worth $100K per year (p.2). However, this level of engagement did not operate in the reverse. The authors found "negligible increases" (p.2) to the number of edits and editor signups. Based on these results, the authors observed that the relationship between Wikipedia and the two communities was "one-way", with Wikipedia providing more value than it received in return.

Considering this new direction in studying Wikipedia, there are a number of elements that require commentary. The first is the obvious care the authors displayed in their methods. For example, they were conscious of the need to adjust their analyses to consider the skew of current events by providing inter-rater agreement on the required qualitative analysis that this required. The second comment is that there is a conceptual mismatch of using revenue as an appropriate metric for analyzing value created "between communities", considering that the communities themselves do not receive any profit. Perhaps future research in this area might need greater granularity in the type of relationships that reflect differences between community-to-community, owner-to-owner, and community-to-owner.

Despite this terminological slippage, this research adds specific details to Van Djick's analysis of the social media ecosystem[9] where she described the character of the relationship between Google and Wikipedia within a for-profit context. Likewise, the article provides greater support to conclusions presented in an earlier study conducted by McMahon, Johnson & Hecht.[10] In that paper, Google's usage of Wikipedia content in its Knowledge Graph results was shown to reduce the amount of through traffic when a link to Wikipedia was removed. As the authors of both papers agree, contextualizing Wikipedia as part of an ecosystem is significant for understanding and assessing how external relationships can be adapted to the sustainability of Wikipedia.

A 2015 study confined to the subreddit /r/todayilearned (TIL) found "strong statistical evidence suggesting Reddit threads affect Wikipedia viewership levels in a non-trivial manner", but did not examine effects on editor activity.[11]

Articles receiving the most attention (by editors) overall lack the depth of quality found in featured articles[edit]

"Knowledge categorization affects popularity and quality of Wikipedia articles"[12]
Reviewed by FULBERT

This empirical research paper explored how knowledge categorization – common in classification systems within the information sciences – works as a scientific and social process when Wikipedia articles are attended to by editors. Categorization leads to nesting of information under major topics, and the further down a hierarchy, the less editing attention articles appear to garner. Articles higher in the hierarchy are referred to as coarse-grained, and while these receive the most attention, their levels of quality have not been the focus of previous studies.

The researchers analyzed a database dump of the English-language Wikipedia from October 20, 2016, considering all articles that were members of at least one category (n=5,006,601). They defined granularity as the length of the shortest path from the root (main category), which averaged 7.59 across all articles, which they then compared to the number of article edits (which related to preception of higher quality articles), the number of articles as rated by importance (done individually by WikiProjects), perceptions of quality (based on being classified as a featured article), and the notion of return on effort (quality of an article relative to the amount of work done on it by editors). They conducted non-parametric and parametric statistical analyses using numerous variables based on the many article records through their data dump.

There were many levels of findings, with the main one being that articles in coarse-grained categories (those nearest the top of the hierarchies) received the most number of edits and attention from editors, though they were least likely to be featured (highest quality) articles. This seemed to surprise the authors, as it means that those articles that receive the most attention (by editors) overall lack the depth of quality found in featured articles, most of which are further down the hierarchy.


"Mean number of edits is displayed in the x-axis. The linear regression coefficient α1 of the granularity variable explaining the number of edits [...] is displayed in the y-axis. Area of points is proportional to the number of articles in the respective top-level category. (Figure 6 from the paper)


"The baseline probability of featured articles in the respective TLC [top-level category] is displayed in the x-axis. The logistic regression coefficient of the granularity variable, when controlling for the number of edits [...], is displayed in the y-axis." (Figure 7 from the paper)

Conferences and events[edit]

See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.

Other recent publications[edit]

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions are always welcome for reviewing or summarizing newly published research.

Compiled by Barbara (WVS) and Tilman Bayer
  • "Time-focused analysis of connectivity and popularity of historical persons in Wikipedia"[13] From the abstract: "Our study sheds new light on the characteristics of information about historical people recorded in the English Wikipedia and quantifies user interest in such data. We propose a novel style of analysis in which we use signals derived from the hyperlink structure of Wikipedia as well as from article view logs, and we overlay them over temporal dimension to understand relations between time periods, link structure and article popularity."
  • "Patients are content with Dr. Google"[14] (Article in German, translated title: "Health information: He who searches, will find – Patients are content with Dr. Google") 72% of patients in Germany consult "Wikipedia and other online encyclopedias" for health information. 54% find Wikipedia "trustworthy".
  • "WikiLyzer: Interactive Information Quality Assessment in Wikipedia"[15] From the abstract: "We developed WikiLyzer, a toolkit comprising three Web-based interactive graphic tools designed to assist (i) knowledge discovery experts in creating and testing metrics for quality measurement , (ii) users searching for good articles, and (iii) users that need to identify weaknesses to improve a particular article. A case study suggests that experts are able to create complex quality metrics with our tool and a report in a user study on its usefulness to identify high-quality content."
  • "To link or not to link: Ranking hyperlinks in Wikipedia using collective attention"[16] From the abstract: "... we tackle overlinking in Wikipedia as a ranking problem. We apply Learning to Rank algorithms to evaluate the click frequency of links in an effort to distinguish the most useful links for users. To accomplish this, we develop a ground truth, which serves as baseline for our algorithm and compare hyperlink features to implement the most advantageous ones. The results show 86.2% accuracy with the top-6 most useful features and 87.7% accuracy with the complete feature set. Considering these results, we outline a solution to the overlinking problem. By removing the most inadequate links, we suggest that readability of Wikipedia articles could be improved while preserving most of its useful links."
  • "Usage of Wikipedia by health science and social sciences & humanities undergraduates of University of Peradeniya and SouthEastern University of Sri Lanka"[17] From the paper: "[Survey] participants were given five options to indicate how they use information services and technologies in searching for their study requirements. The study show that 79% of total students access Wikipedia and 78% use Google to find their information followed by Google scholar (60%), library (57%), and other databases 43% . 69% of FAC and 67% of FAHS students mentioned Wikipedia as their first source of information while embarking into searching for information. 79% students of both faculties use Wikipedia as the first source, while 69% use it as the only source. [...] According to the outcome of the study, Wikipedia was crowned as the first source and most common source when undergraduates seek for information."
  • "Emo, Love, and God: Making Sense of Urban Dictionary, a Crowd-Sourced Online Dictionary [Wiktionary]"[18] Conclusions: "The lexical content of UD [Urban Dictionary] is radically different from that of Wiktionary, another crowd-sourced, but more highly moderated dictionary. In general, we can say that the overlap between the two dictionaries is small. Considering all unique UD headwords that are not found in Wiktionary, we found that this number is almost three times the number of headwords that uniquely occur in Wiktionary. However, if we exclude words with only one definition in UD (which tend to be infrequent or idiosyncratic words), we found the opposite pattern, with Wiktionary-only headwords amounting to almost three times the UD-only headwords."

References[edit]

  1. Shi, Feng; Teplitskiy, Misha; Duede, Eamon; Evans, James (2017-11-29). "The Wisdom of Polarized Crowds". arXiv:1712.06414 [cs.SI]. 
  2. Zielinski, Kazimierz; Nielek, Radoslaw; Wierzbicki, Adam; Jatowt, Adam (2018). "Computing controversy: Formal model and algorithms for detecting controversy on Wikipedia and in search queries". Information Processing & Management 54 (1): 14–36. doi:10.1016/j.ipm.2017.08.005. 
  3. Soler-Adillon, Joan; Pavlovic, Dragana; Freixa, Pere (2018). "Wikipedia in higher education: Changes in perceived value through content contribution". Comunicar (in Spanish) 26 (54): 39–48. ISSN 1134-3478. doi:10.3916/c54-2018-04.  English version here
  4. Mehdi, Mohamad; Okoli, Chitu; Mesgari, Mostafa; Nielsen, Finn Årup; Lanamäki, Arto (2017). "Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus" (PDF). Information Processing & Management 53 (2): 505–529. doi:10.1016/j.ipm.2016.07.003. 
  5. Martin, Brian (2017). "Persistent Bias on Wikipedia: Methods and Responses". Social Science Computer Review: 089443931771543. doi:10.1177/0894439317715434.  Closed access Author's copy
  6. Forte, Andrea; Andalibi, Nazanin; Gorichanaz, Tim; Kim, Meen Chul; Park, Thomas; Halfaker, Aaron (2018-01-07). "Information Fortification: An Online Citation Behavior" (PDF). Proceedings of the 2018 ACM Conference on Supporting Groupwork, GROUP 2018, Sanibel Island, FL, USA, January 07-10, 2018. ACM. pp. 83–92. ISBN 9781450355629. doi:10.1145/3148330.3148347. 
  7. Heindorf, Stefan; Potthast, Martin; Engels, Gregor; Stein, Benno (2017). "Overview of the Wikidata Vandalism Detection Task at WSDM Cup 2017". arXiv:1712.05956 [cs.IR]. 
  8. Vincent, Nicholas; Johnson, Isaac; Hecht, Brent (2018-04-21). "Examining Wikipedia With a Broader Lens: Quantifying the Value of Wikipedia's Relationships with Other Large-Scale Online Communities" (PDF). CHI 2018. Montréal, QC, Canada: Association of Computing Machinery. 
  9. Dijck, José (2013). The culture of connectivity : a critical history of social media. Oxford New York: Oxford University Press. Chapter 7.4. ISBN 9780199970780. 
  10. McMahon, Connor; Johnson, Issac; Hecht, Brent (2017). "The Substantial Interdependence of Wikipedia and Google: A Case Study on the Relationship Between Peer Production Communities and Information Technologies". Eleventh International AAAI Conference on Web and Social Media. AAAI. pp. 142–151. 
  11. Carson, S. L.; Dye, T. K.; Goldbaum, D.; Moyer, D.; Carson, R. T. (2015). "Determining the influence of Reddit posts on Wikipedia pageviews". Wikipedia, a Social Pedia: Research Challenges and Opportunities: Papers from the 2015 ICWSM Workshop. ICWSM 2015. Association for the Advancement of Artificial Intelligence. pp. 75–82. ISBN 9781577357377. 
  12. Lerner, Jürgen; Lomi, Alessandro (2018-01-02). "Knowledge categorization affects popularity and quality of Wikipedia articles". PLOS One 13 (1): 1–22. Bibcode:2018PLoSO..1390674L. ISSN 1932-6203. doi:10.1371/journal.pone.0190674. 
  13. Jatowt, Adam; Kawai, Daisuke; Tanaka, Katsumi (2018-02-08). "Time-focused analysis of connectivity and popularity of historical persons in Wikipedia". International Journal on Digital Libraries: 1–19. ISSN 1432-5012. doi:10.1007/s00799-018-0231-4. Closed access
  14. "Gesundheitsinfos: Wer suchet, der findet – Patienten mit Dr. Google zufrieden". Spotlight Gesundheit, Bertelsmann Stiftung. 2018. 
  15. di Sciascio, Cecilia; Strohmaier, David; Errecalde, Marcelo; Veas, Eduardo (2017). WikiLyzer: Interactive Information Quality Assessment in Wikipedia. IUI '17. New York, NY, USA: ACM. pp. 377–388. ISBN 9781450343480. doi:10.1145/3025171.3025201. 
  16. Thruesen, P.; Čechák, J.; Sezñec, B.; Castalio, R.; Kanhabua, N. (December 2016). To link or not to link: Ranking hyperlinks in Wikipedia using collective attention. 2016 IEEE International Conference on Big Data (Big Data). pp. 1709–1718. doi:10.1109/BigData.2016.7840785.  Closed access
  17. Dehigama, Kanchana; Jazeel, M. I. M. (2017-12-07). "Usage of Wikipedia by health science and social sciences & humanities undergraduates of University of Peradeniya and SouthEastern University of Sri Lanka". 
  18. Nguyen, Dong; McGillivray, Barbara; Yasseri, Taha (2017-12-22). "Emo, Love, and God: Making Sense of Urban Dictionary, a Crowd-Sourced Online Dictionary". arXiv:1712.08647 [cs.CL]. 


Wikimedia Research Newsletter
Vol: 8 • Issue: 02 • February 2018
About • Subscribe: Email WikiResearch on Twitter WikiResearch on Facebook WikiResearch on mastodon.social[archives][Signpost edition][contribute][research index]