From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Wikimedia Research Newsletter
Wikimedia Research Newsletter Logo.png

Vol: 4 • Issue: 1 • January 2014 [contribute] [archives] Syndicate the Wikimedia Research Newsletter feed

Translation assignments, weasel words, and Wikipedia's content in its later years

With contributions by: Aaron Halfaker, Jonathan Morgan, Piotr Konieczny and Tilman Bayer

Translation students embrace Wikipedia assignments, but find user interface frustrating[edit]

An article, "Translating Wikipedia Articles: A Preliminary Report on Authentic Translation Projects in Formal Translator Training", [1] reports on the author's experiment with "a promising type of assignment in formal translator training which involves translating and publishing Wikipedia articles", in three courses with second- and third-year students at the Institute of English Studies, University of Warsaw.

It was "enthusiastically embraced by the trainees ... Practically all of the respondents [in a participant survey] concluded that the experience was either 'positive' (31 people, 56% of the respondents) or 'very positive' (23 people, 42% of the respondents)." And "more than 90% of the respondents (50 people) recommended that the exercise 'should definitely be kept [in future courses], maybe with some improvements,' and the remaining 5 people (9%) cautioned that improvements to the format were needed before it was used again. No-one recommended culling the exercise from the syllabus."

However, the author cautions that Polish–English translations required more instructor feedback and editing than translations from English into Polish (the students' native language). And "most people found the technological aspects of the assignment frustrating, with most students assessing them as either 'hard' (39%) or 'very hard' (16%) to complete. The technical skills involved not only coding and formatting using Wikipedia's idiosyncratic syntax, but the practical aspects of publication. [Asked] to identify areas requiring better assistance, the respondents predominantly focused on the need for better information on coding/formatting the article and on publishing the entry. Thirty-nine people (almost three-quarters of the respondents) found the publication criteria baffling enough to postulate that more assistance was needed. That is even more than the 36 people (68%) who had problems dealing with Wikipedia's admittedly idiosyncratic code."

In the researcher's observation, this contributed to the initially disappointing success rate: "Of the 59 respondents, only eight had their work accepted [after drafting it in a sandbox]. Seven people were asked to revise their entries to bring them into line with Wikipedia's publication guidelines but neglected to do so, and 36 did not even try to publish. Some of those people were still waiting for their feedback to get a green light, but this result can only be described as a big disappointment. ... After a resource pack on how to translate and publish a Wikipedia entry was distributed to a fresh batch of students in the following semester, the successful publication rate proved significantly higher." These English-language instructions are humorously written in the form of a game manual ("Your mission is to create a Polish translation of an English-language article and deliver it safely to the Free Encyclopaedia HQ officially known as 'Wikipedia'. Sounds easy? Think again. Wikipedia is defended by an army of Editors who guard its gates night and day to stop Lord Factoid and his minions from corrupting it with bad articles."). They are available on the author's website, together with a small list of the resulting articles (which is absent from the actual research paper).

The project was inspired by author Cory Doctorow's use of Wikipedia in a 2009 course – most likely the one listed here, although the paper fails to specify it. The absence of discussion of the Wikipedia policies, combined with the absence of any references to prior research from the field of Wikipedia in education, makes it almost certain that the author was unaware of Wikipedia policies and available support (Wikipedia Education Program, etc.).


Why bots should be regarded as an integral part of Wikipedia's software platform[edit]

In a new paper titled "Bots, bespoke code, and the materiality of software platforms"[2] published in Information, Communication & Society, Stuart Geiger (User:Staeiou) presents a critical reflection on the common view of online communities as sovereign platforms governed by code, using Wikipedia as an example. He borrows the term "bespoke" to refer to code that affects the social dynamics of a community, but is designed and owned separately from the software platform (e.g. Wikipedia bots). Geiger mixes vignettes describing his personal experience running en:User:AfDStatBot with discussions of the related literature (including Lessig's famous "code is law") to advocate "examining online communities as both governed by stock and bespoke code, or else we will miss important characteristics of mediated interaction."

"Precise and efficient attribution of authorship of revisioned content"[edit]

Using a graph-theoretic approach, Flöck and Acosta investigate[3] a new algorithm that can detect the author of a part of document that has been edited by many. They use a units-of-discourse model, to identify paragraphs, sentences and words, and their connections. The authors claim that this approach can identify an author with 95% precision, which is more than the current state-of-the art. Most intriguing is that to make this comparison they have created the first "gold standard", a hand-made benchmark of 240 Wikipedia pages and their complex authorship histories.

"Which news organizations influence Wikipedia?"[edit]

This is the question asked in a blog post[4] by a post-doc researcher at Columbia University's Tow Center for digital journalism. Looking at the top 10 news stories of 2013 – an admittedly subjective set determined by the author – the organizations from which the citations come are analyzed. Leading the pack are the New York Times, Washington Post and CNN, but the author notes that the tail of the distribution is very long – 68% of citations are not produced by the top 10 organizations. Qualitative analysis discusses "the surprise for the news organizations that don’t make the top ten; CBS News, ABC News, FOX News [...] this top ten strikes as leaning left overall".

Weasels, hedges, and peacocks in Wikipedia articles[edit]

Some computational linguists find many Wikipedia articles to be a superlative corpus for natural language processing applications. Weasel words, hedges, and peacock terms (like the ones in the previous sentence) are labelled by Wikipedia editors because they tend to make an article less objective. A recent study[5] leverages this work to understand general features of the way people use subjective language to increase uncertainty about the truth or authority of the statements they make. By examining a set of 200 Wikipedia articles that had been flagged for these terms, the researchers found 899 different keywords that were frequently used as peacock terms, weasel words, and hedges. A machine learning classifier that was trained on this set of key words was able to identify other (unlabeled) articles that were written in a subjective manner, with high accuracy. In the future, approaches like these could lead to better automated detection of inappropriately subjective or unsourced statements—not only in Wikipedia articles, but also news articles, scientific papers, product reviews, search results, and other scenarios where people need to be able to trust that the information they are reading is credible.

WikiSym/OpenSym call for submissions[edit]

The call for submissions (until April 20) to this year's WikiSym/OpenSym conference lists 15 research topics of interest in the Wikipedia research track. The conference has taken place annually since 2005; this year's instance will take place from August 27–29, 2014 in Berlin, Germany. As in preceding years, the organizers intend to apply for financial support from the Wikimedia Foundation, addressing the open access concerns voiced in previous years with a reference to a new policy of ACM, the publisher of the proceedings.

Gender imbalance in Wikipedia coverage of academics to be studied with 2-year NSF grant[edit]

Sociologists Hannah Brückner (New York University Abu Dhabi) and Julia Adams (Yale University) have received a two-year grant over US$132,000 from the National Science Foundation for a research project titled "Collaborative Research: Wikipedia and the Democratization of Academic Knowledge". As described in a press release this month, the project will study "the way gender bias affects the development of pages for American academics in the fields of computer science, history, and sociology, disciplines that vary in their gender composition. ... For instance, 80 percent of academics listed on the Wikipedia page American Sociologists are male, while in reality less than 60 percent of American sociologists are male." The researchers plan to create lists of academics in each field who satisfy the notability criteria for academics, and compare them with the actual coverage on Wikipedia.

Discussions about accessibility studied[edit]

A paper presented at last year's SIGCHI Conference on Human Factors in Computing Systems (CHI'13)[6] examines the English Wikipedia as one of two "case studies of two UGC communities with accessible content". Starting from uses of Template:AccessibilityDispute, and pages related to Wikipedia:WikiProject_Accessibility, the authors "identified 179 accessibility discussions involving 82 contributors" and coded them according to content and other aspects.

Wikipedia content "still growing substantially even in later years"[edit]

A preprint[7] by two researchers from Stanford University and the London School of Economics analyzes the history of around 1500 pages in the English Wikipedia's Category:Roman Empire over eight years, providing descriptive statistics for 77,671 (non-bot) edits for articles in that category. The authors find that "content is still growing substantially even in later years. Less new pages are created over time, but at the page-level we see very little slow-down in activity." They identify a "key driver of content growth which is a spill-over effect of past edits on current editing activity" – that is, articles that have been edited more often in the past attract more editing activity in the future, even when controlling for factors such as the page's "inherent popularity", suggesting a causal relationship.

Discover "winning arguments" in article histories, and notify losing editors[edit]

Winning the best paper award at last year's European Semantic Web Conference (ESWC), three authors from the French research institute INRIA presented (video)[8] "a framework to support community managers in managing argumentative discussions on wiki-like platforms. In particular, our approach proposes to automatically detect the natural language arguments and the relations among them, i.e., support or challenges, and then to organize the detected arguments in bipolar argumentation frameworks." Specifically, they analyzed the revision history of the five most revised pages on the English Wikipedia at one point (e.g. George W. Bush), extracting sentences that were heavily edited over time while still describing the same event. To these "arguments" they apply a NLP technique known as textual entailment (basically, detecting whether the assertion of the new version of the sentence logically follows from the first version, or whether the first version was "attacked" by a subsequent editor by deleting or correcting some of the information). The paper focuses mostly on establishing and testing this methodology, without detailing the actual results derived from the five revision histories (i.e. which arguments actually won in those cases), but the authors promise that "this kind of representation helps community managers to understand the overall structure of the discussions and which are the winning arguments." Also, they point out that it should make it possible to "notify the users when their own arguments are attacked."


  1. Piotr Szymczak: Translating Wikipedia Articles: A Preliminary Report on Authentic Translation Projects in Formal Translator Training. In: Acta Philologica 44 (Warszawa 2013) p.61ff
  2. Geiger, R. Stuart. "Bots, bespoke code, and the materiality of software platforms". Information, Communication & Society: 1–15. ISSN 1369-118X. doi:10.1080/1369118X.2013.873069.  Closed access, author's copy at
  3. Fabian Flöck, Maribel Acosta: WikiWho: Precise and Efficient Attribution of Authorship of Revisioned Content.
  4. Fergus Pitt: Which News Organizations Influence Wikipedia? January 17, 2014,
  5. Vincze, Veronika: Weasels, Hedges and Peacocks: Discourse-level Uncertainty in Wikipedia Articles.
  6. Kuksenok, Katie; Brooks, Michael; Mankoff, Jennifer (2013). "Accessible Online Content Creation by End Users". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI '13. New York City: ACM. pp. 59–68. ISBN 978-1-4503-1899-0. doi:10.1145/2470654.2470664. 
  7. Aleksi Aaltonen, Stephan Seiler: Cumulative Knowledge and Open Source Content Growth: The Case of Wikipedia
  8. Elena Cabrio, Serena Villata, and Fabien Gandon: A Support Framework for Argumentative Discussions Management in the Web.

Wikimedia Research Newsletter
Vol: 4 • Issue: 1 • January 2014
This newsletter is brought to you by the Wikimedia Research Committee and The Signpost
Subscribe: Email WikiResearch on Twitter WikiResearch on Facebook[archives] [Signpost edition] [contribute] [research index]