User:Jah/histfilter

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Overview[edit]

Histfilter is a cgi script that displays filtered page histories.

This has only been tested under Linux and with Firefox 1.5.

Things you can filter:

  • Don't show vandal edits which have already been reverted.
  • Combine multiple successive edits by the same user.
  • Include/exclude edits by specific users or user groups.
  • section histories (show only versions that affect a given section)
  • Select some text from any version and the script shows you, in which version the text was inserted, altered or deleted.

This is a screenshot of the form. But you don't need to use the form. You can also edit your monobook.js file (like in de:Benutzer:Jah/monobook.js) so that links to the program are generated on article pages (including section history links).

There is, however, one disadvantage. Before you can use the program you must preprocess the database dump, which takes some time (currently about one day for the German Wikipedia dump).

Installation[edit]

All files are saved here as normal wiki pages. Please copy their wiki sourcecode and save the files under the same name. Check the page histories before you run any script.

/recompress (perl script that extracts the relevant information from the dump; save it anywhere and change permissions)

Suppose Apache is installed under /var/www.

Save the these files in /var/www/cgi-bin and change permissions:

and these files in /var/www/html:

Then edit recompress and histfilter. Change $datadir to a location where you have enough free disk space (currently 10GB for the German Wikipedia is enough).

Get the dump from http://dumps.wikimedia.org (the one with pages-meta-history in its file name).

If you have saved the .7z-dump and recompress in the current directory, then issue the following command (or something alike for other projects):

7z e -so dewiki-20060628-pages-meta-history.xml.7z | ./recompress de wikipedia

Several files are created in $datadir. No progress indicator is shown, but the files should grow constantly. For instance the .rev-file finally has 1406223581 bytes when dewiki-20060628-pages-meta-history.xml.7z is used, which is 1916220129 bytes long.

Usage[edit]

You can call histfilter directly: http://localhost/cgi-bin/histfilter

Or add the contents of de:Benutzer:Jah/monobook.js to your monobook.js-Page. Then next to the normal page history link, a new link "HF" appears, also next to each section edit link. If you mark some text, the link on top of the page is altered, so that the text filter is activated. To get back the normal HF link, click once or twice on the article page without selecting anything.

There are two views, the form view and the table view. You can toggle between them by clicking on the link form or table next to the page name in the first line (on green background).

The form should be self-explanatory. In the table view #S means "number of sections", and it is highlighted when the section structure changes. If you place the mouse pointer on such a highlighted field, a description of the changes pops up. ΔL means "length difference" (characters).

If you don't like the colours you can change them in the CSS file.

Technical remarks[edit]

The text filter is based on de:Wikipedia:Hauptautoren. One day for preprocessing 2GB may sound incredibly slow, but 7zip inflates the dump to approximately 200GB, so this makes an average processing speed of about 2MB per second, or 5 articles per second (plus ignoring non-article pages). Not too bad for a perl script, I think. It uses an old trick: It doesn't process the complete revision texts but only those sections that have changed.