Research:Wiki Ed student editor contributions to sciences on Wikipedia

From Meta, a Wikimedia project coordination wiki
21:28, 20 May 2016 (UTC)
Duration:  2016-05 – 2016-06
This page documents a completed research project.

2016 is the Year of Science for the Wiki Education foundation. With this push to create new science content, how do we determine what the real impact is on Wikipedia? To find the portion of science content being generated by Wiki Ed, we analyse work performed on science articles by the Spring 2016 Wiki Ed cohort. We conclude that while Wiki Ed science output varies substantially based on the current position in the academic year, at its peak, students maintain a sustained output of 6% of total science-related content output.

Research Question[edit]

RQ: What portion of science content is generated by Wiki Ed students?


Defining science articles[edit]

Our first challenge is to identify what articles count as science articles. We do this by first finding science WikiProjects and then finding articles tagged by these projects.

We start with the Wikiproject directory, selecting all projects listed under Science. This gives us a diverse list of 295 projects. This list is then reviewed and non-top-level projects (e.g. taskforces like Climate change task force) are removed as well as projects that fell significantly out of a reasonable definition of science (i.e. Internet culture). Our final list of selected projects is listed below.

Selected projects[edit]

  1. Abortion
  2. AIDS
  3. Aircraft
  4. Airlines
  5. Airports
  6. Algae
  7. Alternative_fuels
  8. Alternative_medicine
  9. Amiga
  10. Amphibians_and_Reptiles
  11. Anatomy
  12. Animal_anatomy
  13. Animals
  14. Apple_Inc.
  15. Aquarium_Fishes
  16. Aquatic_Invertebrates
  17. Archaeology
  18. Arthropods
  19. Astronomical_objects
  20. Astronomy
  21. Audiovisual_telecommunications
  22. Australian_biota
  23. Automobile_construction
  24. Aviation
  25. Banksia
  26. Beekeeping
  27. Beetles
  28. Bell_System
  29. Biology
  30. Biophysics
  31. Biota_of_Great_Britain_and_Ireland
  32. Birds
  33. Bivalves
  34. Blades
  35. C++
  36. Cannabis
  37. Carnivorous_plants
  38. Cats
  39. Cell_Signaling
  40. Cellular_devices
  41. Cephalopods
  42. Cetaceans
  43. Chemical_and_Bio_Engineering
  44. Chemicals
  45. Chemistry
  46. Civil_engineering
  47. Climate
  48. Cognitive_science
  49. Color
  50. Computational_Biology
  51. Computer_graphics
  52. Computer_music
  53. Computer_science
  54. Computer_Security
  55. Computer_Vision
  56. Computing
  57. Cosmology
  58. Cryptography
  59. Cryptozoology
  60. Dams
  61. Databases
  62. Dentistry
  63. Dinosaurs
  64. Dogs
  65. Dyslexia
  66. Earthquakes
  67. Earth_science
  68. Eclipses
  69. Ecology
  70. Economics
  71. Ecoregions
  72. Electrical_engineering
  73. Electronics
  74. Elements
  75. Energy
  76. Engineering
  77. Environment
  78. Equine
  79. Evolutionary_biology
  80. Explosives
  81. Extinction
  82. Firearms
  83. First_aid
  84. Fishes
  85. Forestry
  86. Formation_Evaluation
  87. Free_Software
  88. Fungi
  89. Futures_studies
  90. Game_theory
  91. Gastropods
  92. Gemology_and_Jewelry
  93. Gender_Studies
  94. Genetics
  95. Geologic_timescale
  96. Geology
  97. Gliding
  98. Golden_ratio
  99. Health_and_fitness
  100. History_of_Biology
  101. History_of_Nuclear_Enterprise
  102. History_of_Science
  103. Horticulture_and_Gardening
  104. Hospitals
  105. Human_Genetic_History
  106. Insects
  107. IRC
  108. Java
  109. KDE
  110. Lepidoptera
  111. LGBT_studies
  112. Linux
  113. Malware
  114. Mammals
  115. Mantodea
  116. Marine_life
  117. Mars
  118. Mass_spectrometry
  119. Mathematical_and_Computational_Biology
  120. Mathematics
  121. Mathematics_Competitions
  122. Measurement
  123. Medicine
  124. Metalworking
  125. Meteorology
  126. Method_engineering
  127. Microbiology
  128. Microsoft
  129. Microsoft_Windows
  130. Mind-Body
  131. Mining
  132. Molecular_and_Cellular_Biology
  133. Monotremes_and_Marsupials
  134. Moon
  135. National_Health_Service
  136. Nature
  137. .NET
  138. Neuroscience
  139. NIH
  140. NLP_concepts_and_methods
  141. Non-tropical_storms
  142. Numbers
  143. Nursing
  144. Optics
  145. Perl
  146. Pharmacology
  147. Phasmatodea
  148. Physical_Chemistry
  149. Physics
  150. Physiology
  151. Plan_9
  152. Plants
  153. Pollution
  154. Polyhedra
  155. Polymers
  156. Primates
  157. Probability
  158. Programming_languages
  159. Prokaryotes_and_protists
  160. Pseudoscience
  161. Psychedelics,_Dissociatives_and_Deliriants
  162. Psychology
  163. Pterosaurs
  164. Radio_Stations
  165. RISC_OS
  166. RNA
  167. Robotics
  168. Rocketry
  169. Rocks_and_minerals
  170. Rodents
  171. Sanitation
  172. Science
  173. Sea_Monsters
  174. Seamounts
  175. Severe_weather
  176. Sexology_and_sexuality
  177. Sharks
  178. Signal_Processing
  179. Software
  180. Soil
  181. Solar_System
  182. Spaceflight
  183. Spectroscopy
  184. Spiders
  185. Statistics
  186. Superfunds
  187. Systems
  188. Systems_Engineering_Initiative
  189. Technology
  190. Telecommunications
  191. Time
  192. Trains
  193. Trains_in_Japan
  194. Transportation
  195. Tree_of_Life
  196. Tropical_cyclones
  197. Uniform_Polytopes
  198. Veterinary_medicine
  199. Viruses
  200. Volcanoes
  201. Water
  202. Women_scientists
  203. Years_in_science
  204. Zoo

Collecting science revisions[edit]

After selecting science WikiProjects we were still left needing to identify the pages belonging to the selected project and the revisions belonging to the selected pages.

Selecting science pages[edit]

We used the category labels to identify science pages. Using the replica databases provided at wmflabs, we selected all category links indicating quality rating or importance which frequently contain the project name. The output of this query is then further processed to extract the project name and associated page id. From this list of project page-id pairs we select all page ids in the selected projects.

Selecting science revisions[edit]

With our science pages identified we iterate through all enwiki-20160601-stub-meta-history*.xml.gz dumps selecting out and diffing the number of bytes between consecutive revisions for all revisions belonging to the identified pages.

Tallying daily contributions[edit]

Using the selected revisions we tally the number of positive bytes added by Wiki Ed students and by contributors in general. We iterate through our data set of selected revisions, searching back 10 revisions to look for a sha1 matching, indicating a likely revert. If we find a match we omit all edits between the original and the reverting edit. The reverting edit is also omitted. For each edit that passes this test and contributes a positive number of bytes we identify its date and contributor. If the contributor is part of the Wiki Ed cohort it is added to that date's sum of contributions for Wiki Ed.


Bytes added by general users to Wikipedia in science articles during spring 2016.
Bytes added by general users to Wikipedia in science articles during spring 2016.

Examining contribution trends by general users throughout the semester shows that they are relatively consistent with the exception of some intermittent spikes. The largest of these in science contribution, and actually in all contribution in the history of Wikipedia occurred on April 11 and correlated with User:Rfassbind merging list of 100 minor planets into lists of 1000 minor planets. For the sake of further analysis contributions made on April 11 were removed from consideration since the metric was dominated by the shifting around of content, rather than its creation which is the concept we are attempting to measure.

The median level of contribution over Wikipedia's entire lifespan is 1129336 bytes/day while looking only over the past two year it is a slightly reduced 1079130 bytes/day.

Bytes added by Wiki Ed students to Wikipedia in science articles during spring 2016.
Bytes added by Wiki Ed students to Wikipedia in science articles during spring 2016.

Looking at student bytes added to science articles we see a clear relationship with the academic schedule. Students are more active in the last two months of the typical semester, correlating to the typical times in which term papers would be due. There are also smaller bursts of activity in mid February and early March believed to be correlated with quarter system classes.

During the spring 2016 semester general editors produced 192878894 bytes, 167649187 bytes when April 11 is removed. Wiki Ed students produced 4685882 bytes of content, 4611449 bytes omitting April 11, during the spring 2016 semester. This amounts to 2.8% of all science content added. If we narrow our focus to the most active time in the school year, between April 1 and May 15 (April 11 omitted), Wiki Ed students contribute 5.9% of all science content added.

Portion of bytes added by Wiki Ed students during spring 2016 to science articles.
Portion of bytes added by Wiki Ed students during spring 2016 to science articles.