Research:GCI Wiki Study/2018

From Meta, a Wikimedia project coordination wiki

As one of the many editors of the English Wikipedia, I'm asking myself different thoughts regarding wiki statistics and I made a formal study of it. You can help me by answering these questions!

Instructions[edit]

Information[edit]

Graph sample, here of the average thanks received to edit count ratio by editor percentile.
  • Basic to middle SQL skills are needed to perform these tasks.
  • The target is to have for each question an SQL request and its result in the format of a table. Finally, we need to present the data visually as a graph.
  • We will work on English Wikipedia (witch database is enwiki_p), so it'll be pretty simple for people who is interested to run these queries on other wikis.

Tools used[edit]

"Replicas" is the common name for the SQL read-only database servers that allow everybody to run queries against publicly available Wikis datas. For example, all page edits are available in the revision column.

Quarry is a simple web interface that can be used to execute queries on replicas. The data is returned in tables and can be exported in various formats. Don't hesitate to use links present in the site header, they give information about how to navigate in replicas.

General instructions[edit]

Here are step-by-step instructions. Please follow them carefully.

  1. Claim one of the available tasks on GCI website.
  2. Try to understand the question, imagine what you'll need to look at (revisions? log entries?), what filters will be needed (by date? by user?).
  3. Create a Wikipedia/Wikimedia account if you don't have one yet.
  4. Log in on Quarry website and fork this query sample (the Fork button is at the right of the title).
  5. Try to understand what this query sample does, edit and run it until you've the excepted result. That's the most important part of the work.
  6. Be sure that with the data present in the query result you'll be able to answer to the question, with words or with a graph.
  7. Once you've the wanted data, download the exported file as CSV from Quarry and open it in your favorite spreadsheet app. You can use Excel, Libreoffice, or even Google Documents.
  8. Create a graph that visually answer to the question. Be careful to details, like correct legends, comprehensive title, or attractive colors. You are free to create one that speaks to you.
  9. Finally write a few sentences in English that explain what the table and the graph show, how to read them and what should be know to correctly understand them.
  10. Export this graph in SVG or PNG format. Then, upload it on Wikimedia Commons, the Wikimedia media file repository, with appropriate title and description. Please add category GCI Wiki Study 2018 charts when asked. If you can't export it properly, as a last recourse, you can take a screenshot of your screen showing the graph then properly crop the file with appropriate software (Paint does it as needed).
  11. Last step, create a new section to this page (button on the top right) with the question sentence, the SQL query, the table, the graph and your description, if possible in this order.
    To properly render your SQL query, use <source lang="sql">your query here</source> syntax.
    To insert the table, Quarry can export it in the good format for you. Copy/paste the Help:Wikicode export in the wiki page.
    To insert an image (the chart uploaded in last steep), have a look to the documentation.
  12. Finished! Before submitting the task for review on the GCI site, please verify that you've followed all the instructions.

During all this small process don't hesitate to ask for help on GCI site. You can also discuss with me with instant chatting system on IRC by contacting Framawiki on #wikimedia-techconnect. If I don't answer in minutes that is because I'm not behind my computer, so feel free to ask your question in the channel, other mentors will probably be present to help you.

Results[edit]

Here are results to given questions. Each section correspond to a task on the GCI site and therefore to the participation of a student. To add your own, click on "Add section" on the top right of this page.

Percentage of users that use Javascript gadgets and CSS snippets in their pages[edit]


As you can see on the chart, only 2.3 % of user pages use gadgets, and even smaller amount use the CSS snippets - only 1.1 %

To get this data I used two similiar SQL queries:

For javascript gadgets:

use enwiki_p;
select COUNT(DISTINCT SUBSTRING_INDEX(page_title, '/', 1))/(select COUNT(DISTINCT SUBSTRING_INDEX(page_title, '/', 1))
                                                            from page
                                                            where page_namespace=2) * 100 as Percentage
from page
where page_namespace=2 and page_content_model="javascript"

And for the CSS ones:

use enwiki_p;
select COUNT(DISTINCT SUBSTRING_INDEX(page_title, '/', 1))/(select COUNT(DISTINCT SUBSTRING_INDEX(page_title, '/', 1))
                                                            from page
                                                            where page_namespace=2) * 100 as Percentage
from page
where page_namespace=2 and page_content_model="css"
Percentage of user pages (%)
Javascript gadgets 1,1596
CSS snippets 2,3108
Users without any custom Javascript and styles 96,5296

Number of articles that redirect to a page that is not an article itself[edit]

Count of articles that redirect to a non-article page
28,438

As you can see in the table, there are 28438 articles on the English Wikipedia that redirect to a page that's not an article (doesn't have namespace 0).

To get this data, I used the following query:

use enwiki_p;

SELECT COUNT(DISTINCT rd_from) FROM redirect INNER JOIN page ON redirect.rd_from = page.page_id AND redirect.rd_namespace != 0 AND page.page_namespace = 0;

Redundant deletion reasons used on the weekends[edit]


As you can see from the chart above, the top 5 most redundant deletion reasons only represent around 9.35% of the 1086516 total deletion messages used on weekends (Friday and Saturday).

Here are the top 5 most redundant deletion reasons used on weekends:

To get this data, I used the query:

use enwiki_p;

SELECT comment.comment_text, COUNT(comment.comment_text) as Cnt FROM logging 
INNER JOIN comment ON logging.log_comment_id = comment.comment_id
WHERE logging.log_type = 'delete' AND logging.log_action = 'delete' AND logging.log_namespace = 0 AND (WEEKDAY(logging.log_timestamp) = 4 OR WEEKDAY(logging.log_timestamp) = 5)
GROUP BY logging.log_comment_id
ORDER BY Cnt DESC
LIMIT 5;

And to calculate the total number of deletion reasons:

use enwiki_p;

SELECT COUNT(comment.comment_text) FROM comment 
INNER JOIN logging ON comment.comment_id = logging.log_comment_id
WHERE logging.log_type = 'delete' AND logging.log_action = 'delete' AND logging.log_namespace = 0 AND (WEEKDAY(logging.log_timestamp) = 4 OR WEEKDAY(logging.log_timestamp) = 5);

Categorized user gadgets[edit]

Count
Categorized user JavaScript gadgets 1,291

As you can see from the table, there is exactly 1291 unique categorized user JS gadgets.

To get this data, I used the query:

use enwiki_p;

SELECT COUNT(DISTINCT page_title) FROM page
INNER JOIN categorylinks ON page.page_id = categorylinks.cl_from AND categorylinks.cl_to IS NOT NULL
WHERE page.page_content_model = 'javascript' AND page.page_namespace = 2;

Examples of categorized JS gadgets:

JS gadget
User:-JP-/monobook.js
User:0mnipotent/monobook.js
User:102orion/monobook.js
User:1234r00t/Sandboxlink.js
User:4433ismail/cologneblue.js
User:4433ismail/common.js
User:5dsddddd/monobook.js
User:7/namespace_redirect.js
User:7/userhighlighter.js
User:7107delicious/statusChanger2.js

To get a full list of categorized JS user gadgets, you can use this query:

use enwiki_p;

SELECT DISTINCT CONCAT("User:", page_title) FROM page
INNER JOIN categorylinks ON page.page_id = categorylinks.cl_from AND categorylinks.cl_to IS NOT NULL
WHERE page.page_content_model = 'javascript' AND page.page_namespace = 2;

Articles created by bots[edit]

As you can see from the graph and the table, there are 16 bots in the French Wikipedia, which created a total of 45619 articles that are not redirects. The top 2 bots in the table represent a little bit over 85% of the total number of articles created by bots.

Count
Articles created by bots 45,619

To get this data, I used two queries.

To get the number of articles created by each bot:

use frwiki_p;

SELECT revision.rev_user_text, COUNT(revision.rev_user) as Cnt FROM revision
INNER JOIN user_groups ON revision.rev_user = user_groups.ug_user 
INNER JOIN page ON revision.rev_page = page.page_id
WHERE user_groups.ug_group = 'bot' AND revision.rev_parent_id = 0 AND page.page_namespace = 0 AND page.page_is_redirect = 0
GROUP BY revision.rev_user_text
ORDER BY Cnt desc;

To get the total number of articles created by bots:

use frwiki_p;

SELECT COUNT(revision.rev_user) as Cnt FROM revision
INNER JOIN user_groups ON revision.rev_user = user_groups.ug_user 
INNER JOIN page ON revision.rev_page = page.page_id
WHERE user_groups.ug_group = 'bot' AND revision.rev_parent_id = 0 AND page.page_namespace = 0 AND page.page_is_redirect = 0;

Biggest articles created by anonymous[edit]

The following table shows the biggest articles in size, which were created by anonymous users in the French Wikipedia:

As you can see from the above table, the biggest article created by an anonymous user is "Que sais-je ?" with a length of 496079. "Que sais-je ?" is a series of books published by the Presses universitaires de France (PUF) with the aim of introducing the layman to very many different fields. The series has many publications and the article lists them; that is one of the reasons why the article is very huge.

To get this data, I used the following query:

use frwiki_p;

SELECT page.page_title, page.page_len as Len FROM revision 
INNER JOIN page ON revision.rev_page = page.page_id
WHERE revision.rev_parent_id = 0 AND revision.rev_user = 0 AND page.page_namespace = 0 
GROUP BY page.page_title
ORDER BY Len DESC
LIMIT 100;


Top creators in this year[edit]

As you can see from the table above, these are the top 10 article creators in the English Wikipedia in the year 2018. With Qbugbot at #1 with 18382 created article, this user creates articles about bug species.

To get this data, I used the following query:

use enwiki_p;

SELECT rev_user_text, COUNT(rev_user_text) as Cnt FROM revision
INNER JOIN page ON revision.rev_page = page.page_id
WHERE YEAR(revision.rev_timestamp) = 2018 AND revision.rev_parent_id = 0 AND page.page_namespace = 0 AND page.page_is_redirect = 0
GROUP BY revision.rev_user_text
ORDER BY Cnt DESC
LIMIT 10;