User:Ecpp/Sandbox

From Meta, a Wikimedia project coordination wiki

Assessing Data visualization quality on Wikipedia and Commons[edit]

Introduction[edit]

Since its beginning, Wikipedia has always been the subject of numerous academic researches. Some of them focused on finding methodologies that could assess the quality of a Wikipedia article (Lih 2004; Blumenstock 2008; Stvilia et al. 2008) to determine the reliability of textual information. Other researches focused on the visual aspect of Wikipedia, studying how the communities of Wikipedia and Wikimedia Commons create and manage images, graphs, and maps (Viégas 2007; Mauri, Pini e Ciuccarelli 2017). This research wants to shift the attention from textual information to visual information on Wikipedia by investigating the quality of data visualizations on Wikipedia and Commons. 7 qualitative parameters have been formulated to rate the quality of the data visualizations uploaded on Commons and used on Wikipedia Articles, along with 2 quantitative parameters. Eventually, this research wants to be a starting point for further investigations on the use of Data visualization to illustrate Wikipedia’s articles.


Approach[edit]

After establishing a definition of Data visualization, I searched for other academic researches on Data visualization and Wikipedia. During this search, I found some how-to-guide on Wikipedia and Commons: the links to these pages are on the bottom of the page. After defining the parameters to assess the quality of data visualization on Wikipedia, I evaluated the quality of the Graphs images (only the ones uploaded on Commons) that I retrieved from the most popular 100 pages on Wikipedia. 4 out of these 7 parameters were formulated following the rules of Cairo and the visual Variables of Bertin and Roth. The other 3 were developed to follow the rules of Wikipedia. I will describe the different processes in detail below.


Defining Data Visualization[edit]

The first step of the research was to define what kind of images I was going to analyze, that is, to defying what type of images are Data visualization. 3 distinct definitions were taken into account:

Data Graphics visually display measured quantities by mean of the combined use of points, lines and coordinate systems, number symbols, words, shading, and colors.
Tufte, Edward R. 2001. The Visual Display Of Quantitative Information. Graphics Press USA.

Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively
Munzner, Tamara. 2014. Visualization Analysis & Design. 1st ed. A K Peters/CRC Press.

A Data Visualization is a display of data designed to enable analysis, exploration and discovery
Cairo, Alberto. 2016. The Truthful Art. Berkeley, CA: New Riders.

Previous Academic Research[edit]

  1. Search papers that contain the words “DATA VISUALIZATION” AND “WIKIPEDIA” on SCOPUS
  2. Visualize with Gephi a network the keywords of each paper retrieved
  3. Visualize with Gephi a network of the references of each paper retrieved
  4. Extrapolate Case studies

The result of this part will be discussed in the Findings

Approach(qualitative parameters)[edit]

  1. Authoritative Source

To assess the quality of the data visualizations retrieved from the top 100 popular pages on Wikipedia I used 7 parameters

  1. The sources of the picture are authoritative
  2. The sources are easily accessible
  3. The file is in an editable format (SVG)

In this way, everybody can easily edit and distribute the content

  1. The visualization is not misleading.

it does not hide data (e.g: manipulating the axis).

  1. The visual model is appropriate for the data represented
  2. The aesthetic of the graph is simple and clear and does not prevent the users from understanding the data depicted.
  3. The graph is insightful, which means that the visualization “clear the path to making valuable discoveries that would be inaccessible if the information were presented in a different way”(Cairo 2017)

Findings[edit]

Bibliography by Keywords[edit]

Network graph visualization showing papers and keywords retrieved from Scopus

The dataset was obtained by downloading from Scopus a list of the articles with their keywords. The Keywords “Data visualization”,” visualization” and “Wikipedia” were omitted from the dataset. Afterward, the modularity divided the network into 4 clusters.

  1. The first cluster shows a group of researches that worked on visualize the collaborative process on Wikipedia
  2. The second cluster showed a group of paper about collaborative data visualization on Wikipedia
  3. The third cluster contains paper about DBpedia
  4. The fourth and last cluster shows paper about the creation of Wikipedia articles

Bibliography by Reference[edit]

Network graph visualization showing papers and references for each paper retrieved from Scopus

The second Gephi network shows the references for each paper retrieved from the search on Scopus. Using the modularity I highlighted two main clusters that show which paper share some references in common:

  1. the first cluster, it's the biggest one and contains 15% of the elements of the graph. All this paper used visualization to study Wikipedia.

The papers contained in this cluster are “Visualizing recent changes in Wikipedia”Biuk-Aghai et al.(2013), “Visualize large scale human collaboration in Wikipedia” Biuk-Aghai, Pang, Si (2013). The references in commons are “Studying Cooperation and Conflict between Authors with history flow Visualizations“ Viégas Wattenberg, Dave (2004) and "Talk Before You Type: Coordination in Wikipedia" Viégas et al. (2007).

  1. the second cluster contains the paper of the project Cartograph like “Cartograph: Unlocking Thematic Cartography Through Semantic Enhancement” by Jackson, Hecht (2017)

“Visualizing activity on Wikipedia with Chromograms” by Wattenberg, Viégas, Hollenbach (2007) and Vispedia: On-demand data integration for interactive visualization and exploration"(2012) and "Vispedia: an interactive visual exploration of Wikipedia data via search-based integration" (2008) both by Chan et al. were retrieved from the research but also cited as a reference by other papers.

Qualitative Analysis[edit]

Heat map showing the results of a qualitative analysis on 199 data visualization retrieved from Wikipedia top 100 most viewed pages. Each square represent an image

This visualization shows the result of the analysis of the 199 data visualization in the top 100 most popular pages of the English version of Wikipedia. every square is an image This visualization shows the score of each visualization for each parameter.

Heat map showing the results of a qualitative analysis on 199 data visualization retrieved from Wikipedia top 100 mosto viewed pages. Each line is a parameter

The first 3 parameters (the sources of the picture are authoritative the sources are easily accessible the file is in an editable document) are the ones with the lowest scores. The other 4 parameter that defines the quality of a good data visualization got on average a good scores This might be due to two main reason:

  1. There are no clear, well-known guide-lines on commons about data visualization.
  2. To make a data visualization requires some technical knowledge, so the people that upload those images probably have some level of expertise.


Quantitative Analysis[edit]

Scatterplot with 199 images of data visualization retrieved from Wikipedia top 100 most viewed pages

In “Wikipedia as Participatory Journalism: Reliable Sources? Metrics for Evaluating Collaborative Media as a News Source” Lih described two quantitative methods to evaluate the quality of a Wikipedia Article: diversity, the number of unique editors, and rigor, the total number of edits in an article. Diversity and Rigor were re-arranged To understand if these parameters can be used to assess the quality of data visualization on Wikipedia and Commons. Diversity is the number of unique editors that worked on the data visualizations Rigor is the number of uploading of the image (Versioning) These data were gathered from the versioning of each image on Commons. Eventually, the results were compared with the results of the qualitative analysis.

Data visualization and Users[edit]

Network graph visualization showing data visualization images and users retrieved from the most viewed pages on Wikipedia (English)

The visualization shows the 199 images gathered from the top 100 most viewed pages on Wikipedia and the user that worked on them. The network shows that many images are uploaded once and never modified again by other users. Moreover, the 2016 electoral vote results and 2012 electoral vote results are connected by one user who uploaded them several times. This is because these images keep a live track of the election results, and have to be uploaded every time there are new results. A single user worked on most of the data visualizations for the “climate change” page.

Wikipedia and Commons How-to-guide[edit]

Conclusions[edit]

References[edit]

  1. Cairo, Alberto.(2018) The Truthful Art- data, Chart and Maps for communication.New Riders Pub.
  2. Lih, A.. (2004). Wikipedia as Participatory Journalism: Reliable Sources? Metrics for Evaluating Collaborative Media as a News Source. Nature.
  3. Roth, Robert. (2017). Visual Variables. 10.1002/9781118786352.wbieg0761.