Research talk:Newly registered user

From Meta, a Wikimedia project coordination wiki

Work Log[edit]

Archive

Discussion[edit]

User names Versus Count[edit]

The outcome of the metric should be the number of registered users in a given project for the timerange selected rather than user names NRuiz (WMF) (talk) 15:33, 24 April 2014 (UTC)[reply]


Bots excluded?[edit]

Should bots be excluded from this measure? --Halfak (WMF) (talk) 17:26, 4 December 2013 (UTC)[reply]

I would say yes, definitely they should.--GlimmerPhoenix (talk) 18:48, 14 February 2014 (UTC)[reply]
Halfak (WMF), GlimmerPhoenix: I'd like to push back on this suggestion for the following reasons:
  1. new bot registrations should account for a negligible fraction of new user registrations in any given time period
  2. bot status is a property of a user that can change retrospectively. Until we have a different and dedicated process for bot registrations (there have been discussions on whether mandatory API keys are needed, for example), we're going to have to constantly update historical data to account for regular accounts switching to the bot group, I don't think this is worth the effort.
This said, bot identification is going to be a very high priority item for the other categories of metrics, I just don't think it's critical for this specific metric (at least until we start seeing bulk bot registrations). DarTar (talk) 19:47, 18 March 2014 (UTC)[reply]

log_action='newusers'[edit]

At least in the German Wikipedia, the logging table has a 4th type of entry corresponding to log_type='newusers'. This 4th type of entry also has the same value for the log_action column. The count (on an old dump) returns > 85K entries with this combination. Do we know which use case is triggering these 4th type of entries. Thanks. --GlimmerPhoenix (talk) 18:48, 14 February 2014 (UTC)[reply]

GlimmerPhoenix: Aaron and I looked into this yesterday (sorry I missed your comments on the talk page) and it appears that this was the log_action associated with regular account creations for a short time window (September 2005 - April 2006) until the new log_action (create) was introduced. We started a page to document known anomalies in historical data stored in MediaWiki's database. Contributions are very welcome --DarTar (talk) 19:51, 18 March 2014 (UTC)[reply]

Sensitivity analysis[edit]

The current proposal makes a number of assumptions but doesn't present yet a sensitivity analysis. For example, we could analyze the impact of including or excluding in the definition attached users, bot registrations, proxy registrations. DarTar (talk) 23:09, 18 March 2014 (UTC)[reply]

Qs from the Analytics Developers[edit]

The output of the SQL includes usernames however we are interested in just a daily count. Is there a particular reason the code doesn't return counts? This applies to the other metrics as well. KLeduc (WMF) (talk) 21:30, 24 April 2014 (UTC)[reply]

Mostly because I didn't know that you guys wanted counts. Can you provide me with a spec of what you expect from each of the metric SQL statements so that I can fix the SQL once and be done? --Halfak (WMF) (talk) 20:19, 25 April 2014 (UTC)[reply]

Difference between sample queries for wiki's logging table and Eventlogging[edit]

COUNT(*)-ing the lines emitted by the two Sample queries for “local”, I obtain a difference of ~0.7% (checked with enwiki, dewiki, elwiki) between the queries. Especially, since enwiki has a count of ~150K, it looks the queries are measuring different things. What is causing this difference? --QChris (talk) 12:38, 30 April 2014 (UTC)[reply]

As far as I can tell (see my work), these are records there were dropped from EventLogging, but appear in the production database. --Halfak (WMF) (talk) 19:44, 19 May 2014 (UTC)[reply]

Selection criterium for self-created accounts[edit]

As there are users having a log entry with matching log_type = 'newusers' AND log_action = 'create' for the same username on more than one project (e.g.: enwiki and dewiki, both in 2014), is checking for log_type and log_action selective enough? (Or how could users create such log entries in two different projects?) --QChris (talk) 12:55, 30 April 2014 (UTC)[reply]

Hey QChris, I'm not sure what you are referring to here, but the logging table is project database specific. So, in other words, it's no concern that users may register with the same name on different wikis. If you could show an example of such an entry after the deployment of central auth, we should take a look at it with csteipp. --Halfak (WMF) (talk) 19:22, 19 May 2014 (UTC)[reply]
Hi Halfak, sorry for the vagueness. Let's have an example:
  select * from enwiki.logging WHERE log_id = 54002013;
  select * from dewiki.logging WHERE log_id = 58580076;
The first one is from enwiki, the second from dewiki. (Username etc is probably not secret, but I prefer to not paste concrete data in here.) --QChris (talk) 12:26, 21 May 2014 (UTC)[reply]
QChris, I think I see now. Since this measure is generated on a per-wiki basis it shouldn't be a problem. Looking through centralauth, this user did go through the regular account registration process on both dewiki and enwiki, but later associated the local accounts with their global account via password. This is an unusual case, so I don't expect that it will have substantial implications for this method of filtering R:Attached users. In fact, this user is "attached" the expected way on 42 different wikis (log_action = "create2"). However, in the future, we may choose to reference the centralauth database to look for evidence of post-registration attachment via password. --Halfak (WMF) (talk) 16:23, 23 May 2014 (UTC)[reply]
@CSteipp:, see above – any clue why this happened? --DarTar (talk) 19:42, 23 May 2014 (UTC)[reply]

Binning of new users[edit]

It should be possible to bin the new users somehow, as there is a considerable difference on a user creating an account and a user doing one or more edits in the main space. I'm not even sure there are one such bining scheme, as alternate schemes could be users doing good faith edits only, and another could be users doing non-destructive edits only. That would the lead to five different binning schemes.

An alternative would be to use filtering, where good faith editors and destructive editors are filter options. — Jeblad 11:24, 31 October 2018 (UTC)[reply]

Percent change between months[edit]

The percent change between months does not make much sense. At any month the random change is quite large, so a difference between two months will be close to a completely random number. A better measure would be to count over a integer number of weeks, and compare two such intervals. A single month does not cover an integer number of weeks, and should not be used, but three months are very close to 13 weeks and could be used. Over three months there will be seasonal variations, and that will compound the problem. Only really safe period is a year. It is possible to compare weekdays in following weeks, like Mondays in week 42 and week 43, but not a Monday in week 42 and a Friday in week 43. People do different things on Monday and Friday, and that will introduce an error.

Slow variations can be visualized by using a sliding window for calculating the mean for a small number of weeks. This set can also be used for calculating the standard deviation. A graph can be made over a year, from January to December, and a line graph made for each year. Trends will then show up as periodic changes, stacked on top of each others, with an even and slowly changing trend between them. If the lines are colored in sequence it is easy to see how they changes.

I would propose additional graphs being made. One type has periodic graphs, over a period of one year and over a period of one week. Another type shows a sliding window mean overlaid on a bar graph, with addition of a one year difference. — Jeblad 12:54, 31 October 2018 (UTC)[reply]