Research talk:Teahouse long term new editor retention/Work log/2015-10-15
Thursday, October 15, 2015
Today I want to find out how many people in the sample visited the Teahouse, as well as when they visited, and what Teahouse page they visited first. I'll add this information to the th_retention_sample table.
update th_retention_sample trs2, (select rev_user, min(rev_id) as first_rev, rev_timestamp as first_rev_timestamp, DATEDIFF(DATE(str_to_date(rev_timestamp, '%Y%m%d%H%i%s')), DATE(sample_date)) as days_to_first_rev, page_id as first_rev_page from ( select rev_id, rev_timestamp, rev_user, page_id, sample_date from th_retention_sample trs join enwiki.revision r on trs.user_id = r.rev_user join enwiki.page p on r.rev_page = p.page_id where p.page_namespace in (4,5) and p.page_title like "Teahouse%" and r.rev_timestamp > 20141017000000 order by rev_user_text, rev_id desc) tmp group by rev_user order by rev_user) tmp2 set trs2.first_teahouse_rev_id = tmp2.first_rev, trs2.ftr_timestamp = tmp2.first_rev_timestamp, trs2.ftr_page_id = tmp2.first_rev_page, trs2.days_to_ftr = tmp2.days_to_first_rev where trs2.user_id = tmp2.rev_user; #232 rows updated
How many of these people were invited to visit? I'll exclude 3 outliers here for whom the value of days_to_ftr was less than 0. I did some spot checks on those folks, and it looks like they may have been users who were returning under a new account. This editor, for example, who was in the control group and didn't receive an invite, explains when he created his talkpage that he had previously edited Wikipedia, so he may have learned about the Teahouse before.
select count(user_id) from th_retention_sample where sample_group = "invited" and first_teahouse_rev_id IS NOT NULL AND days_to_ftr >= 0; #223 people
That's a surprisingly small percentage of the total invitees. 223/11674 = 0.019, so about 2% of invitees. About half of the number I found in my previous research. However, the edit threshold for invitation was only 5 edits this time, rather than 10, so we should expect a larger attrition rate.
What is the average number of days between invitation and first visit?
select avg(days_to_ftr), count(user_id) from th_retention_sample where sample_group = "invited" and first_teahouse_rev_id IS NOT NULL AND days_to_ftr >= 0; #40.8161
40 days is a lot, there must be some highly-active outliers who are dragging up the average.
How about the median? The median is 6 days. (I cheated here because calculating median is difficult in MySQL, so I used this online calculator.) That seems more or less on point. Let's get a sense of the distribution.
What's the modal number of days until first visit?
select days_to_ftr as mode from th_retention_sample where sample_group = "invited" and first_teahouse_rev_id IS NOT NULL AND days_to_ftr >= 0 group by 1 order by count(1) desc limit 1; #1 day
Speaking of those outliers, how many people visited more than a week after invitation?
select count(user_id) from th_retention_sample where sample_group = "invited" and first_teahouse_rev_id IS NOT NULL AND days_to_ftr > 7; #105 visitors
And a month after invitation?
select count(user_id) from th_retention_sample where sample_group = "invited" and first_teahouse_rev_id IS NOT NULL AND days_to_ftr > 30 #70 visitors
This shocked me. I expected people to visit the Teahouse early, or not at all. But 47% waited a week before visiting, and 31% waited a month or more. I never would have expected this... I suspect that the persistence of the talkpage invite serves as a reminder that help is there when people need it. I can't wait to explore this in greater detail.
Where people went first, when they visited
Now I'm looking to see what Teahouse page visitors edited first on their first visit. There are two main calls to action on the Teahouse landing page: ask a question, or create a profile. Which of these do people choose (first)?
select count(user_id) as num_visitors, page_namespace, page_title from th_retention_sample trs join enwiki.page p on trs.ftr_page_id = p.page_id where sample_group = "invited" and first_teahouse_rev_id IS NOT NULL AND days_to_ftr >= 0 group by page_namespace, page_title order by num_visitors desc;
Here's the output:
num ns page 155 4 Teahouse/Questions 29 4 Teahouse/Guests/Right_column 25 4 Teahouse/Guests/Left_column 8 4 Teahouse/Host_landing 2 5 Teahouse 1 4 Teahouse/Questions/Archive_280 1 5 Teahouse/Questions/Archive_Index 1 4 Teahouse 1 4 Teahouse/Guest_book
Looks like about 70% visited the question page first (presumably to ask a question), 24% created a profile first. I haven't looked at the relative proportion of questions vs profiles from invitees in the Pilot reports, but that distribution seems about right. A few people edited the talkapage, or signed up as hosts as a first action. The archive and guest book edits may be vandalism.