Jump to content

Research talk:Newcomer task suggestions/Work log/2014-09-30

Add topic
From Meta, a Wikimedia project coordination wiki

Tuesday, September 30th

[edit]

Today, I'm generating data to address the following question:

How do the types of edits made in each condition differ?

  • Compare proportion of main namespace edits between conditions.
  • Compare the rate of section edits between conditions (parse comment).
  • Compare the # of bytes changed between conditions (rev_len - parent.rev_len).
  • Compare the length of pages at at the time edit between conditions (parent.rev_len).

Here are the queries I came up with:

experimental_user_revision
SELECT
    DATABASE() AS wiki,
    user_id,
    revision.rev_id,
    revision.rev_comment,
    page_namespace,
    CAST(revision.rev_len AS INT) -
        CAST(IFNULL(parent.rev_len, 0) AS INT) AS bytes_changed,
    IFNULL(parent.rev_len, 0) AS previous_bytes
FROM staging.tr_experimental_user
INNER JOIN revision FORCE INDEX (user_timestamp) ON
    user_id = rev_user AND
    revision.rev_timestamp BETWEEN
        user_registration AND
        DATE_FORMAT(DATE_ADD(user_registration, INTERVAL 7 DAY),
                    "%Y%m%d%H%i%S")
LEFT JOIN revision parent ON
    parent.rev_id = revision.rev_parent_id
INNER JOIN page ON
    revision.rev_page = page_id
WHERE
    wiki = DATABASE();
experimental_user_edit_stats
SELECT
    wiki,
    user_id,
    SUM(rev_id IS NOT NULL) AS revisions_saved,
    SUM(page_namespace = 0) AS mainspace_edits,
    SUM(rev_comment LIKE "/* % */%") AS main_section_edits,
    SUM(IF(page_namespace = 0 AND bytes_changed > 0, bytes_changed, 0)) AS
            main_bytes_added ,
    SUM(IF(page_namespace = 0 AND bytes_changed < 0, bytes_changed, 0)) AS
            main_bytes_removed,
    SUM(bytes_changed) AS main_bytes_changed,
    AVG(IF(page_namespace = 0, previous_bytes, NULL)) AS
            mean_mainspace_previous_bytes
FROM staging.tr_experimental_user
INNER JOIN staging.tr_experimental_user_revision USING (wiki, user_id)
GROUP BY wiki, user_id;

The first query was hell to optimize. Suffice it to say that MySQL needed to be coerced into using an index through the removal of another. See [1] if you are curious.

Now, I think that I'll actually try to generate the per-user stats in R since I only have about 50k rows to deal with. --Halfak (WMF) (talk) 22:09, 30 September 2014 (UTC)Reply


So there's a problem with comparing proportions between users. Users who made many edits have higher weight than editors who have made few edits. To control for this, I randomly sampled one edit per user to generate stats from. Given the large amount of observations, this shouldn't be a problem.

Compare proportion of main namespace edits between conditions

[edit]
The proportion of sampled edits to articles (mainspace) is plotted by experimental condition and wiki.
Mainspace proportion. The proportion of sampled edits to articles (mainspace) is plotted by experimental condition and wiki.
The proportion of sampled user talk page edits is plotted by experimental condition and wiki.
Talk proportion. The proportion of sampled user talk page edits is plotted by experimental condition and wiki.

Except for the low numbers of observations in small wikis, there's no consistent trends here.

Compare the rate of section edits between conditions (parse comment)

[edit]
The proportion of sampled article edits with section comments is plotted by experimental condition and wiki.
Section comment proportion. The proportion of sampled article edits with section comments is plotted by experimental condition and wiki.

No consistent trends here either.

Compare the # of bytes changed between conditions (rev_len - parent.rev_len)

[edit]
The density of bytes changed per sampled user article edit is plotted by experimental condition and wiki.
Bytes changed (density). The density of bytes changed per sampled user article edit is plotted by experimental condition and wiki.
The gemetric mean # of bytes changed per sampled user article edit is plotted by experimental condition and wiki.
Bytes changed. The gemetric mean # of bytes changed per sampled user article edit is plotted by experimental condition and wiki.

Again, no consistent trends. The only differences appear in the cases where there are vanishingly few observations.


Compare the length of pages at at the time edit between conditions (parent.rev_len)

[edit]
The density of previous revision bytes per sampled user article edit is plotted by experimental condition and wiki.
Previous bytes (density). The density of previous revision bytes per sampled user article edit is plotted by experimental condition and wiki.
The gemetric mean # of bytes in the previous article revision is plotted by experimental condition and wiki.
Previous bytes. The gemetric mean # of bytes in the previous article revision is plotted by experimental condition and wiki.

Still no trend. It looks like, regardless of the intervention, newcomers edit the same proportion of mainspace articles. They make the same proportion of section edits. They change about the same amount of content per edit. And they edit the same size pages. We're here again with no effect.  :\ --Halfak (WMF) (talk) 22:28, 30 September 2014 (UTC)Reply