User:LA2/Extraktor

Extraktor is a Perl script that parses a Wikipedia pages-articles.xml dump, such as can be downloaded from http://download.wikimedia.org/ (the archive most mirror sites will probably want), and outputs an index of all template call parameters. This was first suggested by LA2 in August 2005 on talk:Wikidata as "a minimalist approach", and the script was written one year later.

History

December 3, 2007: ISBN linking to Wikipedia, announcement by Eric Hellman on the Web4Lib mailing list that OCLC is following LibraryThing's example in linking to ISBNs in Wikipedia
September 5, 2007: Allow whitespace in ISBNs. For a discussion, see en:Wikipedia talk:ISBN. Here are current statistics from some Wikipedias.
September 4, 2007: Treat <ref> tags as template calls
July 21, 2007: An alternative way to extract structured data from Wikipedia dumps is the DBpedia extraction software from the DBpedia project
February 26, 2007: Wikipedia Citations, with feed, featured by LibraryThing.com
September 8, 2006: Added support for recognizing the RFC and ISBN magic keyword patterns
August 30, 2006: Wikitech-l announcement of this script

Usage

For example, the article en:Ophir, Alaska contains this text:

{{otherusesof|Ophir}}
 '''Ophir''' is an unincorporated area located
 at {{coor dms|63|08|41|N|156|31|10|W|}}

Copy the source code below and save as extraktor.pl, then apply to the uncompressed XML dump like this:

perl extraktor.pl <enwiki-20060816-pages-articles.xml >enwiki.parameters

For the page above, this script should produce the following output:

Otherusesof|Ophir, Alaska|1|1|Ophir
Coor dms|Ophir, Alaska|2|1|63
Coor dms|Ophir, Alaska|2|2|08
Coor dms|Ophir, Alaska|2|3|41
Coor dms|Ophir, Alaska|2|4|N
Coor dms|Ophir, Alaska|2|5|156
Coor dms|Ophir, Alaska|2|6|31
Coor dms|Ophir, Alaska|2|7|10
Coor dms|Ophir, Alaska|2|8|W
Coor dms|Ophir, Alaska|2|9|

In this simple case, there are only position parameters without names, and no advanced syntax. The output is line-oriented with the vertical bar (|) as field separator. Each line has three or five fields.

The first field is the name of the called template.
The second field is the name of the page from where the template was called.
The third field is a sequence number of this call within the page.
The fourth field is the name (where applicable) or positional number of a parameter in this call.
The fifth field is the value of this parameter in this call.

For template calls without parameters, e.g. {{stub}}, fields 4 and 5 are left out.

Disclaimer

This simple script doesn't really parse wikimarkup. It only applies some ad-hoc regular expression substitutions. (Note: this is also what the actual MediaWiki "parser" does!) For practical reasons, some difficult syntax such as <math> is simply replaced with the placeholder $math. These placeholders use Perl-like syntax such as $PAGENAME, $(14), [[link $! linktext]] and @(Taxobox:2).

This is a quick hack, not at all "a piece of German engineering".

Source code

#!/usr/bin/perl -w 

use utf8;
# use encoding 'utf8';
use locale;
use POSIX;
use strict;
# setlocale(LC_CTYPE, "UTF-8");

# Treat <ref> tags as a kind of template call
sub reftag ()
{
    my ($pagename, $seqno, $name, $value) = @_;
    if (defined($name)) {
        print "<ref>|$pagename|$seqno|name|$name\n";
    }
    if (defined($value)) {
        # escape any remaining vertical bars
        $value =~ s/(http:[^ ]*)\|/$1%7c/g;
        $value =~ s/\|/\$!/g;
        print "<ref>|$pagename|$seqno|1|$value\n";
    }
    return "\@(<ref>:$seqno)";
}

 # Parse arguments of one template call
sub template ()
{
    my ($tempname, $pagename, $seqno, $args) = @_;

    # {{foo bar}} calls Template:Foo_bar with upper case F.
    # Might not work for non-ASCII characters.
    $tempname = ucfirst($tempname);
    $tempname =~ s/ /_/g;

    if  (defined ($args)) {
        my ($field, $value);
        my $argc = 0;
        $args =~ s/^\|\s*(.*?)\s*$/$1/;
        foreach my $arg (split(/\s*\|\s*/, $args)) {
            $argc++;
            if ($arg =~ /^([^=]*?)\s*=\s*(.*)$/) {
                $field = $1;
                $value = $2;
            } else {
                $field = $argc;
                $value = $arg;
            }
            print "$tempname|$pagename|$seqno|$field|$value\n";
        }
    } else {
        # Template call without parameters. Only print three fields
        print "$tempname|$pagename|$seqno\n";
    }
    return "\@($tempname:$seqno)";
}

# This is like sub template() but for magic keywords RFC and ISBN that
# are followed by just one argument.
sub magicword ()
{
    my ($tempname, $pagename, $seqno, $arg) = @_;
    print "$tempname|$pagename|$seqno|1|$arg\n";
    return "$tempname $arg";
}

# Parse Wikipedia pages-articles XML dump
my ($title, $text, $append) = ("", "", 0);
while (<>) {
    if (/\<title\>(.*)\<\/title\>/) {
        $title = $1;
        $text = "";
        next;
    }
    $append = 1 if /\<text/;
    $text .= $_ if $append;
    if (/\<\/text\>/) {
        my $i;
        $append = 0;
        $text =~ s/\n/ /g;

        $text =~ s/.*\<text[^\>]*\>(.*)\<\/text\>.*/$1/;

        # Perform various substitutions to get rid of troublesome
        # wiki markup.  In its place, leave $something

        # silently drop HTML comments
        $text =~ s/&amp;lt;!--.*?--&amp;gt;//g;

        # ignore nowiki, non-greedy match, leave $nowiki
        $text =~ s/&amp;lt;nowiki&amp;gt;.*?&amp;lt;\/nowiki&amp;gt;/\$nowiki/g;

        # ignore math, non-greedy match, leave $math
        $text =~ s/&amp;lt;math&amp;gt;.*?&amp;lt;\/math&amp;gt;/\$math/g;

        # wiki link with alternative text, leave $!
        # multiple passes handle image thumbnails
        for ($i = 0; $i < 5; $i++) {
            $text =~ s/(\[\[[^\]\|{}]*)\|([^\]{}]*\]\])/$1\$!$2/g;
        }

        # These are not real template calls, leave $pagename
        $text =~ s/{{(CURRENT(DAY|DOW|MONTH|TIME(STAMP)?|VERSION|WEEK|YEAR)(ABBREV|NAME(GEN)?)?|(ARTICLE|NAME|SUBJECT|TALK)SPACE|NUMBEROF(ADMINS|ARTICLES|FILES|PAGES|USERS)(:R)?|(ARTICLE|BASE|FULL|SUB|SUBJECT|TALK)?PAGENAMEE?|REVISIONID|SCRIPTPATH|SERVER(NAME)?|SITENAME)}}/\$$1/g;

        # template parameter value with default, leave $!
        $text =~ s/{{{([^\|{}]*)\|([^{}]*)}}}/\$($1\$!$2)/g;

        # template parameter values, leave $parameter
        $text =~ s/{{{([^{}]*)}}}/\$($1)/g;

        # template bang escape, leave $!
        $text =~ s/{{!}}/\$!/g;

        my $seqno = 1;
        $text =~ s/(ISBN) +([0-9][- 0-9Xx]+[0-9Xx)/&magicword($1,$title,$seqno++,$2)/eg;
        $text =~ s/(RFC) +([0-9]+)/&magicword($1,$title,$seqno++,$2)/eg;
        # multiple passes handle nested template calls
        for ($i = 0; $i < 5; $i++) {
            # pretend that <ref> tags are a kind of template call
            $text =~ s/&amp;lt;ref(\s+name\s*=\s*(&amp;quot;)?([^&amp;\s]*[^&amp;\s\/])(&amp;quot;)?)?\s*(\/|&amp;gt;([^{}]*?)&amp;lt;\/ref\s*)&amp;gt;/&amp;reftag($title,$seqno++,$3,$6)/eg;
            $text =~ s/{{\s*([^\|{}]*?)\s*(\|[^{}]*)?}}/&amp;template($1,$title,$seqno++,$2)/eg;
        }

        # Debugging
        # print "$title<>$text\n";
    }
}