User:LA2/Extraktor

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Extraktor is a Perl script that parses a Wikipedia pages-articles.xml dump, such as can be downloaded from http://download.wikimedia.org/ (the archive most mirror sites will probably want), and outputs an index of all template call parameters. This was first suggested by LA2 in August 2005 on talk:Wikidata as "a minimalist approach", and the script was written one year later.

History[edit]

Usage[edit]

For example, the article en:Ophir, Alaska contains this text:

{{otherusesof|Ophir}}
 '''Ophir''' is an unincorporated area located
 at {{coor dms|63|08|41|N|156|31|10|W|}}

Copy the source code below and save as extraktor.pl, then apply to the uncompressed XML dump like this:

perl extraktor.pl <enwiki-20060816-pages-articles.xml >enwiki.parameters

For the page above, this script should produce the following output:

Otherusesof|Ophir, Alaska|1|1|Ophir
Coor dms|Ophir, Alaska|2|1|63
Coor dms|Ophir, Alaska|2|2|08
Coor dms|Ophir, Alaska|2|3|41
Coor dms|Ophir, Alaska|2|4|N
Coor dms|Ophir, Alaska|2|5|156
Coor dms|Ophir, Alaska|2|6|31
Coor dms|Ophir, Alaska|2|7|10
Coor dms|Ophir, Alaska|2|8|W
Coor dms|Ophir, Alaska|2|9|

In this simple case, there are only position parameters without names, and no advanced syntax. The output is line-oriented with the vertical bar (|) as field separator. Each line has three or five fields.

  1. The first field is the name of the called template.
  2. The second field is the name of the page from where the template was called.
  3. The third field is a sequence number of this call within the page.
  4. The fourth field is the name (where applicable) or positional number of a parameter in this call.
  5. The fifth field is the value of this parameter in this call.

For template calls without parameters, e.g. {{stub}}, fields 4 and 5 are left out.

Disclaimer[edit]

This simple script doesn't really parse wikimarkup. It only applies some ad-hoc regular expression substitutions. (Note: this is also what the actual MediaWiki "parser" does!) For practical reasons, some difficult syntax such as <math> is simply replaced with the placeholder $math. These placeholders use Perl-like syntax such as $PAGENAME, $(14), [[link $! linktext]] and @(Taxobox:2).

This is a quick hack, not at all "a piece of German engineering".

Source code[edit]

#!/usr/bin/perl -w 

use utf8;
# use encoding 'utf8';
use locale;
use POSIX;
use strict;
# setlocale(LC_CTYPE, "UTF-8");

# Treat <ref> tags as a kind of template call
sub reftag ()
{
    my ($pagename, $seqno, $name, $value) = @_;
    if (defined($name)) {
        print "<ref>|$pagename|$seqno|name|$name\n";
    }
    if (defined($value)) {
        # escape any remaining vertical bars
        $value =~ s/(http:[^ ]*)\|/$1%7c/g;
        $value =~ s/\|/\$!/g;
        print "<ref>|$pagename|$seqno|1|$value\n";
    }
    return "\@(<ref>:$seqno)";
}

 # Parse arguments of one template call
sub template ()
{
    my ($tempname, $pagename, $seqno, $args) = @_;

    # {{foo bar}} calls Template:Foo_bar with upper case F.
    # Might not work for non-ASCII characters.
    $tempname = ucfirst($tempname);
    $tempname =~ s/ /_/g;

    if  (defined ($args)) {
        my ($field, $value);
        my $argc = 0;
        $args =~ s/^\|\s*(.*?)\s*$/$1/;
        foreach my $arg (split(/\s*\|\s*/, $args)) {
            $argc++;
            if ($arg =~ /^([^=]*?)\s*=\s*(.*)$/) {
                $field = $1;
                $value = $2;
            } else {
                $field = $argc;
                $value = $arg;
            }
            print "$tempname|$pagename|$seqno|$field|$value\n";
        }
    } else {
        # Template call without parameters. Only print three fields
        print "$tempname|$pagename|$seqno\n";
    }
    return "\@($tempname:$seqno)";
}

# This is like sub template() but for magic keywords RFC and ISBN that
# are followed by just one argument.
sub magicword ()
{
    my ($tempname, $pagename, $seqno, $arg) = @_;
    print "$tempname|$pagename|$seqno|1|$arg\n";
    return "$tempname $arg";
}

# Parse Wikipedia pages-articles XML dump
my ($title, $text, $append) = ("", "", 0);
while (<>) {
    if (/\<title\>(.*)\<\/title\>/) {
        $title = $1;
        $text = "";
        next;
    }
    $append = 1 if /\<text/;
    $text .= $_ if $append;
    if (/\<\/text\>/) {
        my $i;
        $append = 0;
        $text =~ s/\n/ /g;

        $text =~ s/.*\<text[^\>]*\>(.*)\<\/text\>.*/$1/;

        # Perform various substitutions to get rid of troublesome
        # wiki markup.  In its place, leave $something

        # silently drop HTML comments
        $text =~ s/&amp;lt;!--.*?--&amp;gt;//g;

        # ignore nowiki, non-greedy match, leave $nowiki
        $text =~ s/&amp;lt;nowiki&amp;gt;.*?&amp;lt;\/nowiki&amp;gt;/\$nowiki/g;

        # ignore math, non-greedy match, leave $math
        $text =~ s/&amp;lt;math&amp;gt;.*?&amp;lt;\/math&amp;gt;/\$math/g;

        # wiki link with alternative text, leave $!
        # multiple passes handle image thumbnails
        for ($i = 0; $i < 5; $i++) {
            $text =~ s/(\[\[[^\]\|{}]*)\|([^\]{}]*\]\])/$1\$!$2/g;
        }

        # These are not real template calls, leave $pagename
        $text =~ s/{{(CURRENT(DAY|DOW|MONTH|TIME(STAMP)?|VERSION|WEEK|YEAR)(ABBREV|NAME(GEN)?)?|(ARTICLE|NAME|SUBJECT|TALK)SPACE|NUMBEROF(ADMINS|ARTICLES|FILES|PAGES|USERS)(:R)?|(ARTICLE|BASE|FULL|SUB|SUBJECT|TALK)?PAGENAMEE?|REVISIONID|SCRIPTPATH|SERVER(NAME)?|SITENAME)}}/\$$1/g;

        # template parameter value with default, leave $!
        $text =~ s/{{{([^\|{}]*)\|([^{}]*)}}}/\$($1\$!$2)/g;

        # template parameter values, leave $parameter
        $text =~ s/{{{([^{}]*)}}}/\$($1)/g;

        # template bang escape, leave $!
        $text =~ s/{{!}}/\$!/g;

        my $seqno = 1;
        $text =~ s/(ISBN) +([0-9][- 0-9Xx]+[0-9Xx)/&magicword($1,$title,$seqno++,$2)/eg;
        $text =~ s/(RFC) +([0-9]+)/&magicword($1,$title,$seqno++,$2)/eg;
        # multiple passes handle nested template calls
        for ($i = 0; $i < 5; $i++) {
            # pretend that <ref> tags are a kind of template call
            $text =~ s/&amp;lt;ref(\s+name\s*=\s*(&amp;quot;)?([^&amp;\s]*[^&amp;\s\/])(&amp;quot;)?)?\s*(\/|&amp;gt;([^{}]*?)&amp;lt;\/ref\s*)&amp;gt;/&amp;reftag($title,$seqno++,$3,$6)/eg;
            $text =~ s/{{\s*([^\|{}]*?)\s*(\|[^{}]*)?}}/&amp;template($1,$title,$seqno++,$2)/eg;
        }

        # Debugging
        # print "$title<>$text\n";
    }
}