User:Gmaxwell/WMF2007 vote distribution

Questions have been raised over what impact the mailings had on the WMF 2007 board election.

While the actual impact of the mailing is unknowable, we can still talk intelligently about how proportionally votes were distributed among the projects. If the mass mailings could only be argued to impact the election by impacting the turnout of the projects, our analysis of the project turnout should tell us something about the likely impact of the mails.

In order to talk about how different the actual proportions were from the ideal we must first define an ideal. Several reasonable alternatives are possible and will be described here. For each ideal a rationale will be provided, and our election will be compared to it below.

Even after establishing an ideal, we must know the application of our study.

If our purpose is only to compare the 'fairness' of participation, then it probably makes sense to treat all projects equally, so that a 10% error in the expected participation on ruwikisource is just as bad as a 10% error in the participation rate of dewiki.

If, instead, our purpose is to study what the level of 'fairness' in the election has on the election outcome as a whole, we must also consider project size: a small error for a large project may have much greater impact than a large error for a small project.

Ideal: Active and eligible

Distance over time from "active and eligible" ideal. 0 on the graph means the election percentages perfectly match this ideal, higher numbers indicate increasing inequality.

Eligibility-based metrics are easily justifiable: If the eligibility criteria were biased then the entire election is fundamentally flawed. If we are to accept that the election had any chance of being trustworthy and democratic to begin with we must assume that the eligibility criteria were not biased.

An additional reasonable constraint on eligibility is account activity. Accounts which have been inactive for a long time are unlikely to vote. It has been suggested (link to jwales post goes here) that in the future activity be included as an eligibility criterion.

If all projects had equal turnout of active and eligible users, with activity defined as at least one edit since March 1st, the distribution of voters would be:

enwiki: 48.66%
dewiki: 11.14%
frwiki: 5.63%
jawiki: 4.73%
eswiki: 2.97%
itwiki: 2.82%
plwiki: 2.49%
nlwiki: 2.19%
ruwiki: 1.74%
zhwiki: 1.64%
ptwiki: 1.30%
svwiki: 1.25%
fiwiki: 1.09%
hewiki: 1.06%
(286 other projects with active an eligible users not listed for space reasons, but they are included in the calculations)

Ideal: Pure eligiblity

Ideal: Pure active named users

Distance over time from "5 edits in the last one/two months" ideal. 0 on the graph means the election percentages perfectly match this ideal, higher numbers indicate increasing inequality.

The number of active users is another obvious criteria for measuring the proportionality of the election. In this graph we use having at least 5 edits over either one or two months as the criteria of activity.

The 5 edit criteria has been used by some other studies of active Wikipedians.

Based on the five edits in two months criteria we would expect the voter distribution of the top projects to be:

enwiki: 51.72%
dewiki: 8.01%
frwiki: 4.12%
commonswiki: 3.86%
jawiki: 3.71%
eswiki: 3.65%
itwiki: 2.50%
plwiki: 1.95%
nlwiki: 1.67%
ptwiki: 1.62%
ruwiki: 1.53%
zhwiki: 1.46%

Based on the five edits in one month criteria we would expect the voter distribution of the top projects to be:

enwiki: 49.63%
dewiki: 8.42%
commonswiki: 4.19%
frwiki: 4.18%
jawiki: 4.07%
eswiki: 3.64%
itwiki: 2.68%
plwiki: 2.02%
nlwiki: 1.70%
ptwiki: 1.63%
zhwiki: 1.61%
ruwiki: 1.60%

Ideal: Main namespace page views

Ideal: Size of project content

Ideal: Referrers to donation page

Okay.. maybe not? would be fun if we had the data.

Math and measurements

If we have n projects, both our 'ideal' distribution of voters, and our actual distribution of voters at any given time, can be imagined as single points in n-dimensional space. By measuring the distance between the point defined by our actual voter distribution from the position of the 'ideal' distribution, we can convert our error into a single value which treats all error fairly.

This error metric is used in many optimization problems.