Grants:IEG/Wikiscan multi-wiki

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Languages: English · français
statusselected
Wikiscan multi-wiki
IEG key blue.png
summaryProvide daily-updated statistics for the biggest Wikimedia wikis
targetWikimedia wikis with more than 100,000 edits (currently 336 wikis)
strategic priorityincrease participation
amount6690 EUR (7632 USD)
granteeAkeron
contact• akeron.wp@gmail.com
endorse
created on18:14, 29 February 2016 (UTC)
round 1 2016


This is a translation of the French version, all corrections are welcome.

Project idea[edit]

What is the problem you're trying to solve?[edit]

Provide daily-updated statistics for Wikimedia wikis, allow to know the current and past activity of the wiki, its main contributors according to several metrics, detailed statistics for each user and provide a tool to assess the performance of projects which support the participation of new editors.

What is your solution?[edit]

Transform and improve the website http://wikiscan.org, that already computes many statistics for the French Wikipedia since 2011, into a multi-wiki and multi-language site.

The site calculates statistics by date and user :

  • Display the pages and the most active users on the wiki for the past 24 hours.
  • Display the pages and the most active users for each day and each month since the wiki was created.
  • Show the users who contributed the most since the the wiki was created, with many statistics such as : total edits, number of articles created, number of days and months with at least one action, total number of administrator actions, number of deletions, etc.
  • View detailed statistics of a user account in seconds, even if they total several millions edits.
  • Provide metrics for projects which support the participation of new volunteers.

Introducing Wikiscan[edit]

The purpose of Wikiscan is to bring many useful statistics about Wikimedia projects. There are several statistics already available but they are often scattered across multiple sites and wiki pages, each source is updated at its own pace and not always with the same data sources. Some use dumps with delays from one to several months between updates (eg WikiStats), others use MediaWiki internal data which are approximate (e.g. List of Wikipedians by number of edits). Wikiscan centralizes many statistics in the same place, using the same common basis and the same update rate.

Wikiscan computes all statistics from daily data, which allows for regular updates by recalculating only the current day, and then recombine the data for the month, year, total, and for each user. Complete recalculation is also conducted periodically to include deleted and restored pages or add new statistics. This flexible system allows for pre-calculated statistics to be displayed quickly and updated regularly with minimal computation.

Statistics by date[edit]

The statistics listed below are available for the last 24 hours and for each day and each month since the creation of the wiki :

  • Page statistics that show the most active pages, the number of edits, the number of users who edited the page, the number of reverts, the total difference in size over the period (diff) and the sum of absolute diff. It is possible to sort these out and multiple filters based on namespaces are also available. Examples : April 2016, March 2, 2016, pages January 2010.
  • User stats : number of edits, logged actions (page moves, blocks, etc.), reverts, average contribution rate, total diff, total absolute diff and presence time. It can be filtered to display only the users, IP, bots or admins and in some cases, sorted by columns. Example : March 2016.
  • General Statistics over the period such as the total number of edits and for each namespace, the number of users, IP, pages and articles creations, accumulated presence time, the total for each logged action, etc. Example : March 2016.

The statistics of the last 24 hours [1] are updated several times an hour, they show where current activity is the highest, which pages are the most active, where the major edits wars are, etc. The default sorting combines multiple data to get a better indicator The data especially taken into account are the number of users and edits but also the diff and the total volume in absolute values, so an edit war adding and removing a lot of content will be more highlighted than a little edit war.

Another page called grid uses the same data but shows several blocks with different predefined filters : namespace, the name of the page or a particular sorting. This allows to see in separate blocks which are the most active pages in various fields : articles, new articles, discussions, meta, articles for deletion or the currently most viewed articles.

Statistics on page views allow to take into account the volume of visitors on the most active pages and better anticipate trends in the news. This feature has been currently disabled following a change in servers and waiting for a pending overhaul. An example can be seen on archive.org when this feature was enabled for January 2015 (page views are in the first column).

Planned improvements :

  • Reintegrate page views with an overhaul of the old system which won't scale with multi-wikis. Use of new statistics that exclude bots.
  • Improve the display of the active articles on the last 24 hours, for example view the top ten in more detail with individual graphics for each page showing recent activity.
User stats[edit]
Users'
edits
Users
tested
Wikiscan
average
Xtools-ec
average
500+ 10 0.5 s 5.2 s
5,000+ 10 0.6 s 8.2 s
10,000+ 10 0.6 s 9.7 s
50,000+ 10 0.7 s 38 s
100,000+ 10 0.6 s 51 s
250,000+ 10 0.7 s 93 s
500,000+ 8 0.8 s 220 s
1,000,000+ 3 0.6 s 256 s
1,500,000+ 3 0.8 s 261 s
2,000,000+ 1 0.7 s 467 s
A quick comparison of the response time between Wikiscan and Xtools-ec on fr.wikipedia.org. Tested April 30, 2016.

Wikiscan is commonly used as an edit counter, the feature visitors use the most. Compared to other edit counters, it quickly displays the page, even if the user totals several million edits, whereas other tools may take several minutes to load, or even fail. About 30 statistics are available on editions and logged actions. The goal is to quickly give a picture of a user's contributions with their history and to provide other interesting indicators in addition to the number of edits which can vary a lot, for example by making many small changes with semi-automated tools.

Examples: user with 800 k edits, bot with about 2 million edits, bot with 2.5 million edits.

Some examples of additional interesting statistics :

  • Estimated total time (presence time) : the estimated time spent making edits and logged actions on the wiki, for example a user making one edit every 5 minutes for 30 minutes will have the same time of presence as a user making 1 edit every minutes over the same period, while their edits number would be 5 times smaller. This indicator could be more reliable than the gross number of edits to measure the involvement of a user. It has been used in particular in a professional thesis on the value of volunteering on Wikipedia (see [2]).
  • Participated days/months : the number of different days and months with at least one action. This allows to see which users contribute over time and regularly return. These indicators are more reliable than seniority calculated according to the creation date of the account, because some accounts can remain inactive over long periods. These statistics are used to calculate average levels with other values ​​such as the average time and the number of edits per day participated.
  • Re-editing : the number of editions that are made on the same page in a short time, some users use the preview feature a lot to send all edits at once, while others can edits many times the same page in a short time, which tends to inflate the number of editions.
  • Chain edits : try to detect and count fast edits on different pages for those using scripts with their account.

About thirty statistics are calculated for each user, they are available for the total and for each year and each month since the account creation which allows to see the evolution of each value. It is possible to display percentages, for example to see trends by month or year of publishing rates in articles, discussions, meta, etc.

Overall user statistics[edit]

Since all user statistics are pre-calculated, they can be displayed in a summary table [3]. This allows to quickly view those who most contributed to the wiki. It is possible to sort this view according to most of the thirty available statistics, allowing for example to know those who spent the most time (column "Durée"i.e. "duration"), the more different days (column "Jours" i.e. "days"), the most article creations (column "Articles"), blocks (column "Bloc."), protections (column "Protect."), abusefilter edits (column "Filtres"), etc.

Planned improvements :

  • Display the same table for each year and each month since the wiki creation.
  • Add graphics showing the evolution of active users based on the number of months of participation.
Category or user list[edit]

The same users table can display only users of a MediaWiki category or a list provided. This is used by projects that support the participation of new editors to measure their results. For instance the Afripédia Project supported by Wikimedia France encourages participation in Francophone Africa. If participants insert a user box that adds a category, this will display all participants statistics in one page.

Examples : Afripédia project participants, Training Afripédia in Madagascar.

It is then possible to sort the data according to the available statistics, for example we can see that 51 users of the Afripédia project contributed for 2 hours or more in total presence time [4], or that 47 users have contributed at least 5 different days [5] and 37 users at least 3 different months [6], or that 65 users have added globally over 1 ko on articles [7], etc.

Planned improvements:

  • Allow to specify a wiki page that contains a list of users instead of a category, with options to select a users extraction method.
Other improvements[edit]

The most important change will be the transition to a multi-wikis site that should be optimized to constantly update statistics of hundreds of wikis instead of one. A system of "workers" will allow multiple wiki updates in parallel by size without overloading the server. A maintenance page will be developed to track the status updates of all wikis.

A new global homepage will be created, it will present all the wikis by family (Wikipedia, Wiktionary, etc.). A new home page per wiki will show recent global statistics and graphs showing the historical evolution of the wiki.

Internationalization will allow to choose the interface language, an English and French version will be available at first. Other languages ​​will be added when translation files are available.

The project will be an opportunity to review the code that has been mainly developed little by little and to document the calculated statistics. There will be restructuring in the statistical calculations and the addition of some new ones. Improvements to graphics are planed, perhaps using SVG and JavaScript to replace the current images.

Project goals[edit]

Main goals :

  1. Change the Wikiscan site that currently only supports the French Wikipedia into a multi-wikis site capable of supporting many other wikis.
  2. Set up a multi-language interface and produce an English version to facilitate translations into other languages.
  3. Improve and optimize the site, in particular add global statistics by wiki.

The current goal of the multi-wikis is to support at least all public Wikimedia wikis totalling over 100,000 edits, it currently represents 336 wikis.

The source code will be available under a free license, it will be technically possible to make it work with the database of any wiki using a recent version of MediaWiki.

Project plan[edit]

Activities[edit]

Main steps of the project :

  1. Multi-wiki transformation :
    • Allow the site to work directly with the Wikimedia Labs database rather than a local replication. This will allow the site to operate easily with any Wikimedia wiki
    • Transform into a multi-wikis site with a subdomain and a specific database for each wiki
    • Develop tools for daily statistics updates for each wiki (workers, tracking page, configuration) and perform specific optimizations for small wikis
    • Create a new global home page and one for each wiki that will display global statistics
  2. Internationalization :
    • Transform the interface for multi-language support
    • Translate the interface in English
  3. Community report (after 3 months maximum) :
    • When the Wikiscan site for each wiki will be up and running, it will be announced on several large projects for the first feedbacks.
  4. Improvements and enhancements :
    • View the overall ranking of users for each year and each month
    • Add pageviews on statistics by date
    • Allow users to use a list from a wiki page instead of a category
    • Add new graphics on the evolution of active users based on the number of months of participation
    • Improve the display of the most active articles on the last 24 hours
    • Document the various statistics computed by the site
    • Improve the interface and the graphics
    • Restructure and add statistics
    • Various fixes and optimizations

Budget[edit]

The budget will be used to pay for the hours of programming as an independent developer, nothing is requested for the project management/administrative part. According to an online newspaper [8], the average price of a freelance PHP developer in France is 327 euros per day (40.9 euros gross per hour). As this project represents several weeks of work, I ask for 30 euros per hour (this includes 23% professional charges payable in France).

Description Hours Total
(30 EUR/h)
Multi-wikis site Transformation
1 - Allow the site to calculate statistics directly with the remote Wikimedia Labs database and perform the necessary optimizations 15 450
2 - Transform the site into a multi-wiki site with a subdomain and a database for each wiki 18 540
3 - Set up a system of "workers" to update statistics for each wiki 14 420
4 - Create a new tracking page to monitor the updates status of workers of all wikis 16 480
5 - Create a new global homepage displaying the list of available wikis with some general statistics and graphs for each wiki 20 600
6 - Create a new home page for each wiki with global statistics updated regularly 16 480
7 - Adapt statistics by date for small wiki that have too few edits per day (show only months and add years, allow more than 24 hours for recent edits) 6 180
Total 105 h 3150 €
Internationalization and documentation
8 - Transform the interface for multi-language support ​​with all texts in a single file 18 540
9 - Translate the interface in English 8 240
10 - French documentation of the various statistics computed by the site with possibility to add other languages ​​by translating a file 8 240
Total 34 h 1020 €
Improvements and optimizations
11 - Display the overall ranking of users for each year and each month 6 180
12 - Add page views on the statistics by date, the system must be highly optimized to work with hundreds of wikis 20 600
13 - Allow to use a list of users contained in a wiki page instead of a category 2 60
14 - Add new graphics on the evolution of active users based on the number of months of participation 10 300
15 - Improve the display of the most active articles for the last 24 hours 12 360
16 - UI improvements and graphics 8 240
17 - Restructuring and additions of statistics 6 180
18 - Optimizations and various corrections, code rewriting 20 600
Total 84 h 2520 €
Total project 223 h 6690 €

Total project

6690.00 EUR
7632.62 USD (for 1 EUR=1.1409 USD as of 8 April 2016)

Community engagement[edit]

The site works for the French Wikipedia since 2011 where it is commonly used, there have already been several feedbacks from this community.

I will solicit francophone projects other than Wikipedia (Wiktionary, Wikisource, Wikibooks, Wikiversity, etc.) to better adapt Wikiscan to the particularities of these projects, such as their use of namespaces.

When sites for each wiki start to become operational and the interface is available in English (max 3 months after the start), I will contact the largest wiki in other languages to get feedback. I will probably open a page on meta to centralize discussions on Wikiscan.

Sustainability[edit]

The site is designed to operate autonomously, all statistical updates are automatic processed. I will continue to maintain the server and do bugs corrections as I have been doing for the French Wikipedia site since 2011.

Wikimedia wikis with more than 100,000 edits will be added automatically and their statistics will be automatically activated.

The dedicated server hosting the site is already funded, it belongs to the Wikimédia France association, which ensures good continuity for the project.

Many developments and improvements are possible : support of all small wikis, adding new statistics (deleted contributions in particular), making overall statistics for all users by adding all the wikis, provide custom graphics with the many available statistics, improve and allow interface customization (for example choose oneself filters for blocks displayed in the "grid" page), improve statistics for the list or category of users (calculating the total and average for all these users), export many data available in various formats (CSV, JSON...), Translatewiki support for translations, develop an API to allow external tools to easily use data (e.g. graphics embedded in wiki pages with the graph extension), develop a dedicated MediaWiki extension...

Measures of success[edit]

Public Wikimedia wikis totalling at least 100,000 edits (currently 336 wikis) will get their own functional xxx.wikiscan.org site with the statistics explained in #Introducing Wikiscan and new developed improvements.

Wikiscan Subdomains for the 336 currently supported wiki would be :

Visitor goal : reach at least 300 unique users per month on all new subdomains (thus excluding fr.wikiscan.org and future portal wikiscan.org). The measurement tool is Google Analytics.

Get involved[edit]

Participants[edit]

  • Akeron: software developer, I have all the required technical knowledge to fully realize this project, I know Wikiscan very well because I fully designed it, and I've been a regular Wikimedia editor since 2006, mainly on French Wikipedia.

Community Notification[edit]

This project was presented on some Francophone projects and on English Wikipedia: Wiktionary, Wikisource, Wikiversity, English Wikipedia (idea lab).

Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • Community member: add your name and rationale here.
  1. Endorsed Endorsed Wikiscan is a great tool and I really wish it supports more wikis. Thibaut120094 (talk) 17:25, 8 April 2016 (UTC)
  2. Endorsed Endorsed JackPotte (talk) 17:41, 8 April 2016 (UTC)
  3. Endorsed Endorsed Ernest-Mtl (talk) 18:10, 8 April 2016 (EDT)
  4. Endorsed Endorsed Wikiscan was designed for the French Wikipedia, where I edit the most, and it worked well. I used Wikiscan once a week from 2010 to 2014, because it provided lots of useful infos on users, on articles and so on. I highly regard the product and the software developer. Cantons-de-l'Est (talk) 16:23, 9 April 2016 (UTC)
  5. Endorsed Endorsed Unsui (talk) 20:18, 9 April 2016 (UTC)
  6. Endorsed Endorsed --Yann (talk) 15:06, 11 April 2016 (UTC)
  7. Endorsed Endorsed Lyokoï (talk) 15:15, 11 April 2016 (UTC)
  8. Endorsed Endorsed Tomthepsg (talk) 17:26, 11 April 2016 (UTC)
  9. Endorsed Endorsed Pamputt (talk) 15:36, 11 April 2016 (UTC)
  10. Endorsed Endorsed Otourly (talk) 15:36, 11 April 2016 (UTC)
  11. Endorsed Endorsed essential & expected (so). --Benoît Prieur (talk) 15:37, 11 April 2016 (UTC)
  12. Endorsed Endorsed Wuyouyuan (talk) 13:14, 12 April 2016 (UTC)
  13. Soutient Lionel Scheepmans Contact French native speaker, désolé pour ma dysorthographie 15:42, 11 April 2016 (UTC)
  14. Endorsed Endorsed Highly interesting and useful to deal with. Many thanks Kaviraf Kaviraf (talk) 16:18, 11 April 2016 (UTC)
  15. Endorsed Endorsed Thierry613 (talk) 16:52, 11 April 2016 (UTC)
  16. Endorsed Endorsed I had the opportunity to see pleasantly run this tool on the french wikipedia and the use that could be done. Other projects can profitably use this tool. Request worth studying with the utmost seriousness Crochet.david (talk) 17:05, 11 April 2016 (UTC)
  17. Endorsed Endorsed --Zyephyrus (talk) 22:14, 11 April 2016 (UTC)
  18. Endorsed Endorsed Viticulum (talk) 00:24, 12 April 2016 (UTC)
  19. Endorsed Endorsed A very good tool that could help a lot of communities of the Wikimedia movement. Pyb (talk) 11:49, 12 April 2016 (UTC)
  20. Endorsed Endorsed Looks to fill an important gap in information about Wikipedias. Fences and windows (talk) 16:43, 12 April 2016 (UTC)
  21. Endorsed Endorsed Very helpful tool! Dyolf77 (talk) 23:04, 12 April 2016 (UTC)
  22. Endorsed Endorsed To understand the community and the readers, nothing is more important than clear data, which is provided by this tool. Also, the pageviews stats are the best quantitative data we have to measure the effective sharing of knowledge. We need more of this. Plyd (talk) 09:28, 13 April 2016 (UTC)
  23. Endorsed Endorsed Because this project need acknowledgement of progression. Noé (talk) 21:26, 13 April 2016 (UTC)
  24. Endorsed Endorsed high time to recognise this effort properly. extremly valuable to judge new groups editing. a license fitting to mediawiki would be key, e.g. GPL? --ThurnerRupert 10:13, 14 April 2016 (UTC)
  25. Endorsed Endorsed Akeron, I am Dan Andreescu (milimetric) of the WMF Analytics team. I think what you're doing here is very valuable for the wiki movement, and I support your proposal. We are in the process of overhauling the kind of data that's available to query, so that hopefully building projects like Wikiscan in the future will become much easier. We do not have yet all the technical details, but we're calling the project Wikistats 2.0 and you can come to #wikimedia-analytics and talk to me (milimetric) or Joseph Allemandou (joal) about it anytime. We would love your input on what would make a more useful infrastructure for you. Also I am curious what kind of technology you're using, it seems like it would help us with our efforts. Milimetric (talk) 11:24, 27 May 2016 (UTC)
    Hi, thank you for your support. The technology is basic : PHP/Mysql and some Memcached, this is working fine actually and will work with multi-wikis but in the future I would like to switch to a NoSql solution. Wikiscan needs are mainly low level data from revision and logging tables, something that would be useful is a single table (or some kind of data flow) which contains all this data (ideally with deleted edits too). Like the recentchanges table but with whole project history, other things interesting in this table are old_len/new_len to calculate diff and new page flag. There is also some needs for pageviews data, like retrieving all data for one project for 1 hour or 24 hours. Akeron (talk) 15:17, 27 May 2016 (UTC)
  26. Endorsed Endorsed I'm suitably impressed by the coverage and depth of Wikiscan, as well as its mission to provide near-instantaneous stats. I also like the flexibility of the interface, where a user can easily zoom-in on one year, or month, or even date, where also most tables are sortable, even by a combination of criteria. Well done! I added a section on the project in Wikistats portal. It seems to me euro 6690 is a modest amount, given the scope of the project. Even more when the inevitable rise in complexity (coordination-wise) for a multi-lingual site with many more stakeholders is taken into account. Having said that, the quote is so modest that I wouldn't feel bad if only 70%-80% of the targets were reached. One comment on the scope of Wikiscan: the threshold of 100,000 edits seems rather high to me. For people who care about emerging communities a much lower threshold would be useful. I assume that once Wikiscan can handle 336 wikis, inclusion of even more languages will be a minor effort configuration-wise. Erik Zachte (talk) 08:55, 29 May 2016 (UTC)
    Hi, thank you very much. Wikiscan was initially built and optimized for the size of French Wikipedia, supporting the 100k+ wikis (which represents 99.5% of total edits) is the first step, it shouldn't be hard to support all the small wikis with some dedicated optimization so the current server can handle it. Akeron (talk) 15:26, 30 May 2016 (UTC)