Research talk:Quality dynamics of English Wikipedia/Work log/2017-02-17

From Meta, a Wikimedia project coordination wiki

Friday, February 17, 2017[edit]

Hey folks. Today I'm going to generate some monthly datasets for a few WikiProjects

  • WikiProject Chicago
  • WikiProject Michigan
  • WikiProject Oregon

So, my past queries assumed that only one template was used. This isn't true. There are many redirected template names that can be included on talk pages. For example, Military history has a lot.

> select page.page_title from pagelinks INNER JOIN page ON page_id = pl_from WHERE pl_namespace = 10 and pl_title = "WikiProject_Military_history" AND pl_from_namespace = 10 AND page.page_is_redirect;
+------------------------------+
| page_title                   |
+------------------------------+
| WikiProject_Military_History |
| WPCAS                        |
| WikiProject_Colditz          |
| MILHIST                      |
| Milhist                      |
| WP_Military_History          |
| WikiProject_MILHIST          |
| WikiProject_War              |
| WP_Military_history          |
| WPMILHIST                    |
| WPMilhist                    |
| MilHist                      |
| Mil_Hist                     |
| WPMH                         |
| WP_Mil                       |
| Military_history             |
| WPMIL                        |
+------------------------------+
17 rows in set (0.00 sec)

So in order to account for that, I'm going to change the querying strategy to pull in all of these templates too.

SELECT
  monthly_aq.timestamp AS month,
  SUM(weighted_sum) AS weighted_sum,
  SUM(LOG(weighted_sum)) AS weighted_log_sum,
  SUM(prediction = "Stub") AS stub_n,
  SUM(prediction = "Start") AS start_n,
  SUM(prediction = "C") AS c_n,
  SUM(prediction = "B") AS b_n,
  SUM(prediction = "GA") AS ga_n,
  SUM(prediction = "FA") AS fa_n,
  COUNT(*) AS n
FROM page AS talk
INNER JOIN page AS article ON
  talk.page_title = article.page_title AND
  article.page_namespace = 0
INNER JOIN templatelinks ON
  tl_from = talk.page_id
INNER JOIN staging.enwiki_monthly_aq AS monthly_aq ON
  article.page_id = monthly_aq.page_id
WHERE
  talk.page_namespace = 1 AND
  tl_namespace = 10 AND
  (
    tl_title = @WP_TEMPLATE OR 
    tl_title IN (
      SELECT page.page_title 
      FROM pagelinks 
      INNER JOIN page ON page_id = pl_from 
      WHERE 
        pl_namespace = 10 AND 
        pl_title = @WP_TEMPLATE AND 
        pl_from_namespace = 10 AND 
        page.page_is_redirect
    )
  )
GROUP BY month;

The hard part is that I need to manually do a bit of work to find the right template for a specific WikiProject. My strategy right now is just to go to a page that would *obviously* be classified by the target WikiProject, find out what template is used and check to see if it redirects me to a "main" template and then use that.

E.g.

Alright. Now I'm generating some datasets. --EpochFail (talk) 16:48, 17 February 2017 (UTC)[reply]