Research:Newsletter/2022/May

Vol: 12 • Issue: 05 • May 2022 [contribute] [archives]

35 million Twitter links analysed

"TWikiL - the Twitter Wikipedia Link Dataset"

This paper,^[1] to be presented at next month's ICWSM conference, provides a dataset containing "all Wikipedia links posted on Twitter in the period 2006 to January 2021" - 35,252,782 URLs altogether, from 34,543,612 unique tweets. While framed as a dataset paper designed to enable future research, it also reports various exploratory data analysis results, for example on the distribution of links across Wikipedia languages:

more than half of all links posted on Twitter (54%) are taken from the English language version. Links from the Japanese version account with 24% for the second highest share followed by Spanish, German and French.

The author notes that the Dutch Wikipedia received a high number of Twitter links relative to its share of pageviews.

Analysing the linked articles by topic category (relying on the language-agnostic automated ORES article topic classification rather than Wikipedia categories), the author finds that

"The ranking of article meta categories from most frequent to least frequent is Culture, Geography, STEM and History & Society and this ranking does not change radically through the years. The popularity of Culture might be traced back to biography links which account for 21.3% of all linked items ..."

Concerning the popularity of concepts across languages (i.e. Wikidata items)

"more than half of all concepts were only posted once and that the distribution is again highly skewed. Among the top five most popular concepts we do not find historical figures or events as one could expect, but two boy bands, the South Korean boy band Bangtan Boys (BTS) and the Filipino boy band SoundBreak 19 (SB19). While being among the most linked concepts they still account only for a very small percentage ..."

It is worth bearing in mind that Twitter links provide only a small percentage of the traffic that Wikipedia received from external referrers (where search engines dominate), and in a weekly list of articles that received most social media traffic on English Wikipedia that the Wikimedia Foundation has been publishing until the end of last year, Twitter seems to have appeared less often as referrer than Reddit or Facebook.

Briefly

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Fact-checking dialogue responses using Wikipedia

From the abstract:^[2]

"We introduce the task of fact-checking in dialogue, which is a relatively unexplored area. We construct DIALFACT, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DIALFACT: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information."

As an example dialogue where the (itself automatically generated) response was successfully refuted by the automatically retrieved evidence snippet from Wikipedia, the authors offer the following:

Context: "Biathlon means two sports right? What is the other sport?"
Response: "Biathlon combine the two sports into one event called the cross country ski race. It’s a lot of fun!"
Evidence from the article Biathlon: "The biathlon is a winter sport that combines cross-country skiing and rifle shooting."

"Wikidata Completeness Profiling Using ProWD"

From the abstract:^[3]

"... we present ProWD, a framework and tool for profiling the completeness of Wikidata [...]. ProWD measures the degree of completeness based on the Class-Facet-Attribute (CFA) profiles. A class denotes a collection of entities, which can be of multiple facets, allowing attribute completeness to be analyzed and compared, e.g., how does the completeness of the attribute "educated at" and "date of birth" compare between male, German computer scientists, and female, Indonesian computer scientists? ProWD generates summaries and visualizations for such analysis, giving insights into the KG [ knowledge graph] completeness."

"Representation and the problem of bibliographic imagination on Wikipedia"

From the abstract:^[4]

"The author used an extended example, the Wikipedia article on the Philippine–American War, to illustrate the unfortunate effects that accompany a lack of attention to the kind of sources used to produce narratives for the online encyclopaedia. [...]

Findings

Inattention to sources (a lack of bibliographical imagination) produces representational anomalies. Certain sources are privileged when they should not be and others are ignored or considered as sub-standard. Overall, the epistemological boundaries of the article in terms of what the editorial community considers reliable and what the community of scholars producing knowledge about the war think as reliable do not overlap to the extent that they should."

See also our coverage of earlier papers by the same author

"WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia"

From the abstract:^[5]

"... we propose a task of detecting self-contradiction articles in Wikipedia. Based on the "self-contradictory" template, we create a novel dataset for the self-contradiction detection task. Conventional contradiction detection focuses on comparing pairs of sentences or claims, but self-contradiction detection needs to further reason the semantics of an article and simultaneously learn the contradiction-aware comparison from all pairs of sentences. Therefore, we present the first model, Pairwise Contradiction Neural Network (PCNN), to not only effectively identify self-contradiction articles, but also highlight the most contradiction pairs of contradiction sentences. [...] Experiments conducted on the proposed WikiContradiction dataset exhibit that PCNN can generate promising performance and comprehensively highlight the sentence pairs the contradiction locates."

As an example of a pair of contradictory sentences that were detected successfully (i.e. in accordance with the ground truth), the paper offers the following from the article The Silent Scream (1979 film):

"The film was released theatrically by American Cinema Releasing in limited theaters in November 23, 1979 in Victor, Texas, and in January 30, 1980 in Bismarck, North Dakota."
"The film is released 1980/1/18" [sic]

References

↑ Meier, Florian (2022-05-05). TWikiL -- The Twitter Wikipedia Link Dataset. arXiv. / published version, to appear in Proceedings of ICWSM 2022 (dataset, code)
↑ Gupta, Prakhar; Wu, Chien-Sheng; Liu, Wenhao; Xiong, Caiming (2022-03-24). DialFact: A Benchmark for Fact-Checking in Dialogue. arXiv.
↑ Wisesa, Avicenna; Darari, Fariz; Krisnadhi, Adila; Nutt, Werner; Razniewski, Simon (2019-09-23). "Wikidata Completeness Profiling Using ProWD". Proceedings of the 10th International Conference on Knowledge Capture. K-CAP '19. New York, NY, USA: Association for Computing Machinery. pp. 123–130. ISBN 9781450370080. doi:10.1145/3360901.3364425. Author's copy (thesis, code: [1] [2])
↑ Luyt, Brendan (2021-01-01). "Representation and the problem of bibliographic imagination on Wikipedia". Journal of Documentation. ISSN 0022-0418. doi:10.1108/JD-08-2021-0153.
↑ Hsu, Cheng; Li, Cheng-Te; Saez-Trumper, Diego; Hsu, Yi-Zhan (December 2021). "WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia". 2021 IEEE International Conference on Big Data (Big Data). 2021 IEEE International Conference on Big Data (Big Data). pp. 427–436. doi:10.1109/BigData52589.2021.9671319. , preprint version: Hsu, Cheng; Li, Cheng-Te; Saez-Trumper, Diego; Hsu, Yi-Zhan (2021-11-16). "WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia". arXiv:2111.08543 [cs]. (code and data)

Wikimedia Research Newsletter
Vol: 12 • Issue: 05 • May 2022
About • Subscribe: Email • [archives] • [Signpost edition] • [contribute] • [research index]

[1] Meier, Florian (2022-05-05). TWikiL -- The Twitter Wikipedia Link Dataset. arXiv. / published version, to appear in Proceedings of ICWSM 2022 (dataset, code)

[2] Gupta, Prakhar; Wu, Chien-Sheng; Liu, Wenhao; Xiong, Caiming (2022-03-24). DialFact: A Benchmark for Fact-Checking in Dialogue. arXiv.

[3] Wisesa, Avicenna; Darari, Fariz; Krisnadhi, Adila; Nutt, Werner; Razniewski, Simon (2019-09-23). "Wikidata Completeness Profiling Using ProWD". Proceedings of the 10th International Conference on Knowledge Capture. K-CAP '19. New York, NY, USA: Association for Computing Machinery. pp. 123–130. ISBN 9781450370080. doi:10.1145/3360901.3364425. Author's copy (thesis, code: [1] [2])

[4] Luyt, Brendan (2021-01-01). "Representation and the problem of bibliographic imagination on Wikipedia". Journal of Documentation. ISSN 0022-0418. doi:10.1108/JD-08-2021-0153.

[5] Hsu, Cheng; Li, Cheng-Te; Saez-Trumper, Diego; Hsu, Yi-Zhan (December 2021). "WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia". 2021 IEEE International Conference on Big Data (Big Data). 2021 IEEE International Conference on Big Data (Big Data). pp. 427–436. doi:10.1109/BigData52589.2021.9671319. , preprint version: Hsu, Cheng; Li, Cheng-Te; Saez-Trumper, Diego; Hsu, Yi-Zhan (2021-11-16). "WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia". arXiv:2111.08543 [cs]. (code and data)

[1]

[2]

[3]

[4]

[5]