Research talk:Teahouse long term new editor retention/Work log/2015-11-24

Tuesday, November 24, 2015[edit]

Today, I'm working with some datasets that show how links to the teahouse and subst templates were added to talk pages historically.

I've finished running the teahouse link extractor, but I found a couple of issues.

We're matching "wp:tea" which does not got to the teahouse, but rather en:Wikipedia:A nice cup of tea and a sit down. This is easily fixed with grep.
We're picking up the section titles (e.g. "Foo#Section bar") in the links. This is OK, but it means that we're not counting the link additions in the expected way. Once we get these records loaded into the database, I can use STRING_INDEX() to split off the "#" section info.

> CREATE TABLE th_link_additions_clean SELECT page_id, page_namespace, page_title, rev_id, rev_timestamp, rev_comment, SUBSTRING_INDEX(link, "#", 1) AS link, SUM(diff) AS diff FROM th_link_additions GROUP BY 4,7;
Query OK, 444683 rows affected (10.78 sec)
Records: 444683  Duplicates: 0  Warnings: 0
> alter table th_link_additions_clean rename to th_link_additions;
Query OK, 0 rows affected (0.01 sec)
> CREATE UNIQUE INDEX rev_link ON th_link_additions (rev_id, link);
Query OK, 444683 rows affected (2.54 sec)
Records: 444683  Duplicates: 0  Warnings: 0

OK. That's done. Now to get some monthly stats.

SELECT
  LEFT(rev_timestamp, 7) AS month,
  SUBSTRING_INDEX(link, "#", 1) AS link,
  COUNT(DISTINCT rev_id) AS postings
FROM staging.th_link_additions
GROUP BY 1,2;

Teahouse link postings. The raw count of edits that add links to the teahouse is plotted for the 4 most common teahouse link patterns.

--Halfak (WMF) (talk) 22:03, 24 November 2015 (UTC)[reply]

Here's the raw count of all types of teahouse postings:

Teahouse postings (overall). The raw count of User_talk edits that add a link to the teahouse is plotted by month.

--Halfak (WMF) (talk) 00:16, 25 November 2015 (UTC)[reply]