One related challenge that isn't mentioned is filtering out reversions (including partial reverts) that result in significant amounts of added text. That threw off the Public Policy Initiative student contribution tracking for the one student (Kevin Gorman) who started doing a lot of recent changes patrol; he racked up a huge number of bytes contributed, mostly by reverting content blankings.

I'd also avoid characterizing other kinds of contribution besides text addition as "noise".

There are many other categories besides added text that would be really useful. The most prominent one that comes to mind would be number of inline citations added. Adding a good source (which may have all its content inside a template) can be a more valuable addition than kilobytes of uncited new text. Having that number along with text added would give a much more interesting picture: refs/kilobyte added, and how that changes over time, would be a great thing to know. --Ragesoss 19:56, 24 June 2011 (UTC)

Thanks Sage! That is a great point about reverts which undue blanking and other text processing priorities like citations. It looks like we're going to build a generalized diff text processing tool to tackle this, along with other questions like counting use of templates that are substituted instead of transcluded. If you have ideas you should definitely add to the list of ideas at Research:ABD - All But Diffs. Steven Walling at work 23:49, 24 June 2011 (UTC)