Wikistats/Measuring Article Quality

From Meta, a Wikimedia project coordination wiki

This article discusses measuring Wikipedia quality in conceptual terms, see also subpage Operationalisation for wikistats

Introduction[edit]

Currently wikistats measures and compares Wikimedia projects based on hard facts, counts of all kinds. One of the negative side effects of wikistats is that it encourages some people to compete on article counts, by generating 50 stubs on an evening, and boasting about it how this contributes to soon 'beating' the next Wikipedia higher up in the article count rankings, or to emphasize personal edit counts as a measure of status within the community.

Objective[edit]

Most of the time it is much harder to write one good article than 50 stubs. Would it not be nice if we could compare Wikipedias automatically based on quality indices, next to counts? An old once very influential novel Zen and the Art of Motorcycle Maintenance already exposed how difficult it is to define quality, even when everyone has an intuitive -albeit subjective- notion of what quality is. Some aspects of quality are easier measured automatically than others.

Attainability[edit]

So the objective is not to provide the ultimate automated test of quality for Wikimedia projects, that is impossible. It might be interesting though to give it a go and see how useful automatically derived indices can be in determining whether any Wikimedia project advances towards higher average quality, and to analyze differences between Wikipedias in these aspects.

The ultimate judge of how useful the results are is the user, either individually, or as part of the Wikimedia community, or as part of the general public. So it would be interesting to compare automated results on a small set of articles with manually scores by a sample user group, and possibly feed those manual results back to the scripts to tune parameters.

Any criterium will be imperfect, but the law of large numbers will work in our favour. There may be brilliant articles with less than 20 words but on average they count as 'not so impressive'. There may be articles that do not benefit from adding markup and images (perhaps when a poem is presented and annotated), but on average an article with a good ratio of text length and number of section headers is a indication of how authors have tried to present the content in an orderly and digestible fashion. Also existence of categories, images, references, etc etc, give an indication of how much effort has gone into the article, again on average only.

Of course many will feel that any automated results are so unbalanced and biased that they distort rather than contribute to quality awareness. How many of them will be convinced of the usefulness of automated quality assesment will depend on our results.

What is quality ?[edit]

There is a huge number of aspects that contribute to article quality, all subjective, many contested. Some aspects will be beyond automation for some time, others willl contain many subaspects, and some of those will again be more easy to measure than others. One can only hope that a partial assesment can serve as indicator for a unattainable fully automated quality ranking.

To use an analogy here: a site with a high Google ranking does not per se exhibit high quality, but most googlers trust that on average they will get pretty useful and well organised suggestions by Google.

In very broad terms most people will agree that the assesment of article quality includes

  • Factual correctness
Very ambitious to automate this, beyond the current scope
  • Relevance
Highly subjective, beyond the current scope (although Google rankings might be used a indicator)
  • Verifiability of content
One aspect of this is amount and quality of references. As for quality, are they based on books with good standing, are references to web pages (still) resolvable?
  • Textual structure, readability
Of course the target audience is relevant here. There seem to be automated tests for assesment how well text is structured (at least for some languages). To be researched.
  • Article structure, layout
Most larger articles benefit from section headers and other layout markup. This can be partially measured.
  • Accessability
An uncategorised article may be more difficult to find.
  • Auditability
Are most recent editors anonymous or registered users?
  • Adherence to project guidelines, like NPOV
Revert wars might serve as an indication here.

Approach[edit]

To start simple a few indices might be collected per article and averaged per month, and one or more weighted averages of these indices might yield a general quality index to follow through time.

Instead of collecting all indices and aggregating them in one weighed aggregate number only we could present all indices and one or two aggregations together with the fomula used.

We might test the relevance/validity of the weighing factors by asking a bunch of people to score 20 or more articles by hand and see if the calculated figure correlates well, even optimize the weighing factors based on the manual scores.

We might even provide the user with the option to override parameters, so that for example the fact whether an article is categorised is ignored and correct spelling is made all important in a personalised quality assesment.

It would be great to add indices for spelling errors, grammatical errors and text structure, and there are tools for each of these. User Altergo offered to look into those and dicuss their applicability. Keep in mind that resource usage is a important concern at least for now. Large dictionaries for spell checking may tax server memory, also the English dump is already 800 Gb and takes 2 days to process right now. So modules need to be tested for efficiency. Also interlanguage compatibility is a major challenge. Not to mention Wikimedia oddities like intermingling different spelling conventions like US and British spellings and others. A spelling checker should allow multiple spelling conventions (though possibly substracting points formxing them in one article).

We will have to see how difficult it will be to derive meaningfull results that are recognized as such by the community but it certainly will be an interesting and worthy challenge. Erik Zachte 23:28, 5 August 2006 (UTC)[reply]