User:EpochFail/Journal/2012-03-01

From Meta, a Wikimedia project coordination wiki

Thursday, March 1st[edit]

I want to look at two things tonight.

  1. I need to go some inter-rater reliability metrics for Fabric vs. Wikipedians
  2. I want to learn something about the feedback left on semi-protected articles
    • I'm not sure what I can do with it since I don't think it is worth our volunteers time to hand code it. My plan is to see how I can gather it and to look at a little bit of it by hand.

First with Fabrice's codings. My plan is to:

  1. Gather all codings by coder
  2. Separate codings by other raters and find "strict" and "everyone" measures of utility
  3. Join with fabrice's codings to form a 2 column dataset that is Fabrice vs. Wikipedia
  4. Use Cohen's Kappa to determine reliability.


 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 225 
   Raters = 2 
    Kappa = 0.699 

        z = 10.7 
  p-value = 0 

OK. I'm exhausted. I'll have to look at semi-protected articles this weekend.

03:26, 2 March 2012 (UTC)

Monday, March 5th[edit]

First question: How do I know when a page is semi-protected?


Ahh... There appears to be a page_restriction table. Let's explore...

mysql> select distinct pr_type from page_restrictions;
+---------+
| pr_type |
+---------+
| edit    |
| move    |
| upload  |
+---------+
3 rows in set (0.00 sec)

I think that edit restrictions are what I want. There should be a limitation on the restrictions. Let's check the pr_level column.

mysql> select distinct pr_type, pr_level from page_restrictions;
+---------+---------------+
| pr_type | pr_level      |
+---------+---------------+
| edit    | autoconfirmed |
| edit    | sysop         |
| move    | autoconfirmed |
| move    | sysop         |
| upload  | autoconfirmed |
| upload  | sysop         | 
+---------+---------------+
6 rows in set (0.00 sec)

Wow... so there are pages that only sysops can edit. Which ones?

mysql> select page_namespace, page_title from page inner join page_restrictions ON page_id = pr_id and pr_type = "edit" and pr_level = "sysop" limit 10;
+----------------+-----------------------------------------------+
| page_namespace | page_title                                    |
+----------------+-----------------------------------------------+
|              0 | Legitimacy_of_NATO's_bombing_campaign         |
|              0 | East_Bloomfield,_New_York                     |
|              0 | Nassau_(village),_Rensselaer_County,_New_York |
|              0 | Bf-109                                        |
|              0 | Oussama_bin_Laden                             |
|              0 | Barbara_Hepworth                              |
|              0 | Alp                                           |
|              0 | Majlis                                        |
|              0 | Dangwai                                       |
|              0 | Sunny_Isles                                   |
+----------------+-----------------------------------------------+
10 rows in set (0.00 sec)

Well that is interesting. I was able to edit something that is restricted to sysops only [1].

Is my data out of date?

mysql> select max(rc_timestamp) from recentchanges;
+-------------------+
| max(rc_timestamp) |
+-------------------+
| 20120306002542    |
+-------------------+
1 row in set (0.00 sec)

Wat? Something might be weird. Maybe there are some SpecialPages that will help.

Wait... aha! I was joining on the wrong column.


mysql> select page_namespace, page_title from page inner join page_restrictions ON page_id = pr_page and pr_type = "edit" and pr_level = "sysop" limit 10;
+----------------+-------------------------------+
| page_namespace | page_title                    |
+----------------+-------------------------------+
|              0 | Beatles                       |
|              0 | Colour                        |
|              0 | Dr_Who                        |
|              0 | Floccinaucinihilipilification |
|              0 | Muslims                       |
|              0 | Palestinian_people            |
|              4 | Foldoc_license                |
|              0 | Roman_Catholic                |
|              0 | Sri_Lanka                     |
|              4 | Wikipedia_NEWS                |
+----------------+-------------------------------+
10 rows in set (0.01 sec)

Now these are actually protected. Using this table, I can only get the currently protected pages. Time to look into the log to see if I can get a history of protection per page.

mysql> select log_type, log_action, count(*) from logging where log_type = "protect" group by 1,2;
+----------+------------+----------+
| log_type | log_action | count(*) |
+----------+------------+----------+
| protect  | modify     |    32050 |
| protect  | move_prot  |     8934 |
| protect  | protect    |   267007 |
| protect  | unprotect  |    54961 |
+----------+------------+----------+
4 rows in set (2.86 sec)

Hmm.. What is "modify" or "move_prot"?


mysql> select log_comment, log_params from logging where log_type = "protect" and log_action = "modify" order by log_id desc limit 10;
+-----------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| log_comment                                                                                                     | log_params                                                                                                                       |
+-----------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| tweeking to admin only as socks are using alternate venues to try to get this restored - preempt a sleeper acct | ‎[create=sysop] (indefinite)
                                                                                                  |
| Violations of the [[WP:Biographies of living persons|biographies of living persons policy]]                     | ‎[edit=autoconfirmed] (expires 21:51, 12 March 2012 (UTC)) ‎[move=sysop] (indefinite)
                                       |
| Persistent [[WP:Vandalism|vandalism]]                                                                           | ‎[edit=autoconfirmed] (expires 18:28, 12 March 2012 (UTC)) ‎[move=sysop] (indefinite)
                                       |
| Persistent [[WP:Vandalism|vandalism]]                                                                           | ‎[edit=autoconfirmed] (expires 15:49, 19 March 2012 (UTC)) ‎[move=sysop] (indefinite)
                                       |
| Persistent [[WP:Vandalism|vandalism]]                                                                           | ‎[edit=autoconfirmed] (expires 14:29, 12 March 2012 (UTC)) ‎[move=sysop] (indefinite)
                                       |
| upcoming TFA                                                                                                    | ‎[move=sysop] (expires 00:00, 9 March 2012 (UTC))
                                                                             |
| premature addition of yet unofficial election results                                                           | ‎[edit=autoconfirmed] (expires 00:23, 8 March 2012 (UTC)) ‎[move=sysop] (indefinite)
                                        |
| Persistent [[WP:Vandalism|vandalism]]                                                                           | ‎[edit=autoconfirmed] (expires 23:26, 4 September 2012 (UTC)) ‎[move=autoconfirmed] (expires 23:26, 4 September 2012 (UTC))
 |
| Persistent [[WP:Vandalism|vandalism]]                                                                           | ‎[edit=autoconfirmed] (indefinite) ‎[move=sysop] (indefinite)
                                                               |
| Persistent [[WP:Vandalism|vandalism]]                                                                           | ‎[edit=autoconfirmed] (expires 15:45, 11 March 2012 (UTC)) ‎[move=sysop] (indefinite)
                                       |
+-----------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.00 sec)

It looks like "modify" is an update to an existing protection. I'm worried about the "expires ..." since that could mean I have to parse that to understand what is going on. Hopefully I don't. Lets see if I can find an expiration that corresponds to an "unprotect".

mysql> select * from logging where log_id = 41332822;
+----------+----------+------------+----------------+----------+---------------+------------------------+----------------------------------------------------------------+-----------------------------------------------------------------------------+-------------+------------------+----------+
| log_id   | log_type | log_action | log_timestamp  | log_user | log_namespace | log_title              | log_comment                                                    | log_params                                                                  | log_deleted | log_user_text    | log_page |
+----------+----------+------------+----------------+----------+---------------+------------------------+----------------------------------------------------------------+-----------------------------------------------------------------------------+-------------+------------------+----------+
| 41332822 | protect  | modify     | 20120229050041 |  5532431 |             0 | 420_(cannabis_culture) | Persistent [[WP:Vandalism|vandalism]]: until the IP gets bored | [edit=sysop] (expires 05:00, 3 March 2012 (UTC)) [move=sysop] (indefinite)
 |           0 | Star Mississippi |   145891 |
+----------+----------+------------+----------------+----------+---------------+------------------------+----------------------------------------------------------------+-----------------------------------------------------------------------------+-------------+------------------+----------+

This one expired on March 3rd, so I should be able to find the unprotect if there is one.

mysql> select * from logging where log_type = "protect" and log_action = "unprotect" and log_namespace = 0 and log_title = "420_(cannabis_culture)";
+---------+----------+------------+----------------+----------+---------------+------------------------+-------------------------------------------------------------+------------+-------------+---------------+----------+
| log_id  | log_type | log_action | log_timestamp  | log_user | log_namespace | log_title              | log_comment                                                 | log_params | log_deleted | log_user_text | log_page |
+---------+----------+------------+----------------+----------+---------------+------------------------+-------------------------------------------------------------+------------+-------------+---------------+----------+
| 2487269 | protect  | unprotect  | 20060423035532 |   340280 |             0 | 420_(cannabis_culture) | removing sprotection, high visibility is no longer an issue |            |           0 |               |     NULL |
+---------+----------+------------+----------------+----------+---------------+------------------------+-------------------------------------------------------------+------------+-------------+---------------+----------+
1 row in set (0.02 sec)

Well this is wrong. :\ It looks like I can't trust the "protect", "unprotect" event to always be there. It looks like it has to be done manually.  :(

So... I'm going to have to write a script that parses these log_params to get what I need out of them.  :(


Tuesday, March 6th[edit]

Hypothesis:

  • "protect" "protect" event starts all protection events
  • "protect" "modify" event changes a current protection status
  • "protect" "unprotect" event represents a manual unprotection of a page
  • "protect" "move_prot" moves the protection settings from one page (log_param as "Namespace:PageName") to another page (log_namespace and log_title)
    • I'm not entirely sure this means anything since page moves do not change the page_id.


                  .--<---.
                  |      |
  [protect]--->---[modify]->-.
       |                     |   ,--->[unprotect]
       '--------->-----------'---|
                                 '--->(expiration)

Expiration comes from parsing the log_params field of the table. If the log_params field is not parseable, the protection should be considered to be "(indefinite)".

To process the protection history for a page, you have to process the entire history of the logging table for log_type = "protect" and log_action in "protect", "modify", "unprotect".

I propose a protection history table:

  • page_id - The identifier of the page
  • type - Type of protection (edit|move)
  • group - The user level required to circumvent protection (autoconfirmed|sysop)
  • start - The timestamp at which the protection begins
  • end - The timestamp at which the protection ends (if NULL, protection is currently in effect)


I'd run this query

SELECT log_id, log_action, log_timestamp, log_namespace, log_title, log_params
FROM logging
WHERE log_type = "protect"
AND   log_action IN ("protect", "unprotect", "modify")

Each protect/protect would create a new protection row with start set to log_timestamp. When a protect/modify is found, the current protection is ended and a new protection is started with the new configuration for page_id = log_page. When a protect/unprotect is received, all current protection is stopped for page_id = log_page.

Tomorrow, I'll set out to write the script that will handle this.

Wednesday, March 7th[edit]

I discovered a problem with tracking page protection: page moves. Log events for protection up until Oct. 2009 did not include the page Id. Cool, we'll just look that up in the page table, right? WRONG! The title could have changed since the protection event took place. So... I have to track page moves in order to determine the page_id at the time of the protection. This too is non-trivial since I'll have to track all of the moves for a page until today in order to determine what the pageId should be. This could mean that I have to keep *all* protection events in memory while I process them in order update all of the events without a pageId at the end. Damn. I'll have to consider the cost of solving this problem before I continue. 15:42, 7 March 2012 (UTC)


mysql> select log_page IS NOT NULL, count(*) from logging where log_type = "protect" and log_action = "protect" group by 1;
+----------------------+----------+
| log_page IS NOT NULL | count(*) |
+----------------------+----------+
|                    0 |   184840 |
|                    1 |    82211 |
+----------------------+----------+
2 rows in set (3 min 59.92 sec)

It looks like there are more protect events without a page_id than with one. However, tracking ~250k events should be manageable within memory.

16:05, 7 March 2012 (UTC)