Research talk:Teahouse long term new editor retention/Work log/2015-11-24
Tuesday, November 24, 2015
Today, I'm working with some datasets that show how links to the teahouse and subst templates were added to talk pages historically.
I've finished running the teahouse link extractor, but I found a couple of issues.
- We're matching "wp:tea" which does not got to the teahouse, but rather en:Wikipedia:A nice cup of tea and a sit down. This is easily fixed with grep.
- We're picking up the section titles (e.g. "Foo#Section bar") in the links. This is OK, but it means that we're not counting the link additions in the expected way. Once we get these records loaded into the database, I can use
STRING_INDEX()to split off the "#" section info.
> CREATE TABLE th_link_additions_clean SELECT page_id, page_namespace, page_title, rev_id, rev_timestamp, rev_comment, SUBSTRING_INDEX(link, "#", 1) AS link, SUM(diff) AS diff FROM th_link_additions GROUP BY 4,7; Query OK, 444683 rows affected (10.78 sec) Records: 444683 Duplicates: 0 Warnings: 0 > alter table th_link_additions_clean rename to th_link_additions; Query OK, 0 rows affected (0.01 sec) > CREATE UNIQUE INDEX rev_link ON th_link_additions (rev_id, link); Query OK, 444683 rows affected (2.54 sec) Records: 444683 Duplicates: 0 Warnings: 0
OK. That's done. Now to get some monthly stats.
SELECT LEFT(rev_timestamp, 7) AS month, SUBSTRING_INDEX(link, "#", 1) AS link, COUNT(DISTINCT rev_id) AS postings FROM staging.th_link_additions GROUP BY 1,2;
Here's the raw count of all types of teahouse postings: