From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Following Updates[edit]

In the following days we will make major updates in the WikiXRay tool that will be reflected in this page. Among the most important features that will be release, we have:

  • A completely renewed graphical engine (OOP developed). It is able to produce many different type of 2D and also 3D graphics with GNU R and Gnuplot.
  • The graphics section will be updated with new interesting results concerning inequality in author's contributions and a 3D temporal analysis of most active contributors.
  • A new Python parser suitable for processing the Wikipedia XML dumps, specially designed for research purposes, will be released in this update. It can process the XML dumps creating two .sql files for filling up all the available fields in tables revision and page (according to the MediaWiki 1.10 definintion of tables.sql

--GlimmerPhoenix 21:56, 4 July 2007 (UTC)[reply]


I've used the stub dump version for English version. It contains all the relevant information I'm currently working on, though its text table (which contains the text for every article) is empty. That way I save a lot of space for other language versions. The complete decompressed dump 7z version (text table included) of English Wikipedia extends far beyond 650 GB.

Disk space is the most critical requirement. I currently have 4 SATA-II drives (very fast, 16MB caché) configured in RAID-5, for a total of 1,2 TB. CPU is an Intel Core2Duo 6600 and RAM is 1GB PC-5300.

Regarding the article size, the X axis represents log10(article size in bytes). Each bar width represents a size interval (for example, 3 indicates 1000 bytes, nearly 1KB), and its height represents frequency of appearance. So you can infer that in the standard articles population, the most frequent size is around 3,3 = 1,995 bytes.GlimmerPhoenix 16:31, 27 January 2007 (UTC)[reply]


This is a neat page, but I have a few questions. First and foremost... what kind of system did you load the English Wikipedia dump on to? How much hard drive space, RAM, etc.? How large is the enwiki database after it is decompressed?

Also, could you provide more information as to what the article size histograms mean? I don't really understand how the values on the X axis relate to the article sizes...

Anyways, it's pretty cool what you've done. Thanks! ~MDD4696 02:55, 23 January 2007 (UTC)[reply]

Quality assesments[edit]

Hi, This project looks very interesting! I thought you might be interested in another couple of valuable statistics, quality and importance. I'm mentioning it, as I don't see any use of those parameters in your work so far, but that may be because it's only being used widely on en at the moment (though ru and fr are looking at it). On the English Wikipedia we have system whereby a WikiProject carries out assessments of quality and importance of articles that come within their subject area. These statistics are then collected by bot, and can be used by us at the Wikipedia 1.0 team for gathering articles. So far over 250,000 articles have been assessed for quality, and over 60,000 assessed for importance, across a wide range of subject areas, so you should be able to get statistically useful data. See the main listing for more information. FYI, the quality assessments are pretty standard across all projects, but importance is more relative, being based on the importance within the project. Good luck, Walkerma 03:29, 29 January 2007 (UTC)[reply]


Quality and relative importance are quite interesting. At the very least, you can compare distributions and history for different collections (such as pages meeting a certain quality criteria). Is there any mark in the English database about this? Or a db dump of some kind? (to avoid mining HTML pages or wiki text, if possible).

Other possible sources of metrics for considering "interesting" collections could be those terms most frequently visited, or those present in more than x languages, but for now we have not figured out how to get that kind of information easily.

Jgb 23:25, 31 January 2007 (UTC)[reply]

Request for inclusionٔ[edit]

Is there a way to request some other local wikis (like Simple English or Farsi wikipedias) to be graphed by this project? Huji 15:45, 5 June 2007 (UTC)[reply]

Sure, I'll be very glad of including additional wikis in this analysis. Just tell me if we can get the dumps for those wikis from or instead somewhere else.--GlimmerPhoenix 21:44, 4 July 2007 (UTC)[reply]


I'm happy to see this project. I created wik2dict a couple of years ago, and now I will be researching wikis, especially in relation to trust. I've mainly used numpy/scipy, and don't have any experience with R, but I'm happy to find ways to cooperate. We're using CC-BY and the GPL for most stuff at our trust metrics project. Guaka 17:59, 31 October 2007 (UTC)[reply]

Project Status?[edit]

Is this project still under active development? I see many of the wikixray wiki pages haven't seen much activity for 18 months.

On Wikimedia it is not active, but José Felipe Ortega Soto recently published (February/March? 2009) results using this in his study Wikipedia: A quantitative analysis ( HenkvD 18:43, 9 June 2009 (UTC)[reply]

How to download?[edit]

This Project Has Not Released Any Files... ? --Piotrus 18:22, 4 August 2009 (UTC)[reply]

Python get_Connection() Attribute Error[edit]

Anybody have seen this error as I am not able to run the parser and can not fix this error. I am totally new to Python. 16:54, 1 September 2009 (UTC)[reply]

All download URLs in both the page and discussion are dead[edit]

This project seems to be completely defunct. Other than being of academic and historical interest, it should probably be removed from Wikipedia pages on the topic of working with XML data dumps.