User:LA2/Extraktor
Extraktor is a Perl script that parses a Wikipedia pages-articles.xml dump, such as can be downloaded from http://download.wikimedia.org/ (the archive most mirror sites will probably want), and outputs an index of all template call parameters. This was first suggested by LA2 in August 2005 on talk:Wikidata as "a minimalist approach", and the script was written one year later.
History
[edit]- December 3, 2007: ISBN linking to Wikipedia, announcement by Eric Hellman on the Web4Lib mailing list that OCLC is following LibraryThing's example in linking to ISBNs in Wikipedia
- September 5, 2007: Allow whitespace in ISBNs. For a discussion, see en:Wikipedia talk:ISBN. Here are current statistics from some Wikipedias.
- September 4, 2007: Treat <ref> tags as template calls
- July 21, 2007: An alternative way to extract structured data from Wikipedia dumps is the DBpedia extraction software from the DBpedia project
- February 26, 2007: Wikipedia Citations, with feed, featured by LibraryThing.com
- September 8, 2006: Added support for recognizing the RFC and ISBN magic keyword patterns
- August 30, 2006: Wikitech-l announcement of this script
Usage
[edit]For example, the article en:Ophir, Alaska contains this text:
{{otherusesof|Ophir}} '''Ophir''' is an unincorporated area located at {{coor dms|63|08|41|N|156|31|10|W|}}
Copy the source code below and save as extraktor.pl, then apply to the uncompressed XML dump like this:
perl extraktor.pl <enwiki-20060816-pages-articles.xml >enwiki.parameters
For the page above, this script should produce the following output:
Otherusesof|Ophir, Alaska|1|1|Ophir Coor dms|Ophir, Alaska|2|1|63 Coor dms|Ophir, Alaska|2|2|08 Coor dms|Ophir, Alaska|2|3|41 Coor dms|Ophir, Alaska|2|4|N Coor dms|Ophir, Alaska|2|5|156 Coor dms|Ophir, Alaska|2|6|31 Coor dms|Ophir, Alaska|2|7|10 Coor dms|Ophir, Alaska|2|8|W Coor dms|Ophir, Alaska|2|9|
In this simple case, there are only position parameters without names, and no advanced syntax. The output is line-oriented with the vertical bar (|) as field separator. Each line has three or five fields.
- The first field is the name of the called template.
- The second field is the name of the page from where the template was called.
- The third field is a sequence number of this call within the page.
- The fourth field is the name (where applicable) or positional number of a parameter in this call.
- The fifth field is the value of this parameter in this call.
For template calls without parameters, e.g. {{stub}}, fields 4 and 5 are left out.
Disclaimer
[edit]This simple script doesn't really parse wikimarkup. It only applies some ad-hoc regular expression substitutions. (Note: this is also what the actual MediaWiki "parser" does!) For practical reasons, some difficult syntax such as <math> is simply replaced with the placeholder $math. These placeholders use Perl-like syntax such as $PAGENAME, $(14), [[link $! linktext]] and @(Taxobox:2).
This is a quick hack, not at all "a piece of German engineering".
Source code
[edit]#!/usr/bin/perl -w
use utf8;
# use encoding 'utf8';
use locale;
use POSIX;
use strict;
# setlocale(LC_CTYPE, "UTF-8");
# Treat <ref> tags as a kind of template call
sub reftag ()
{
my ($pagename, $seqno, $name, $value) = @_;
if (defined($name)) {
print "<ref>|$pagename|$seqno|name|$name\n";
}
if (defined($value)) {
# escape any remaining vertical bars
$value =~ s/(http:[^ ]*)\|/$1%7c/g;
$value =~ s/\|/\$!/g;
print "<ref>|$pagename|$seqno|1|$value\n";
}
return "\@(<ref>:$seqno)";
}
# Parse arguments of one template call
sub template ()
{
my ($tempname, $pagename, $seqno, $args) = @_;
# {{foo bar}} calls Template:Foo_bar with upper case F.
# Might not work for non-ASCII characters.
$tempname = ucfirst($tempname);
$tempname =~ s/ /_/g;
if (defined ($args)) {
my ($field, $value);
my $argc = 0;
$args =~ s/^\|\s*(.*?)\s*$/$1/;
foreach my $arg (split(/\s*\|\s*/, $args)) {
$argc++;
if ($arg =~ /^([^=]*?)\s*=\s*(.*)$/) {
$field = $1;
$value = $2;
} else {
$field = $argc;
$value = $arg;
}
print "$tempname|$pagename|$seqno|$field|$value\n";
}
} else {
# Template call without parameters. Only print three fields
print "$tempname|$pagename|$seqno\n";
}
return "\@($tempname:$seqno)";
}
# This is like sub template() but for magic keywords RFC and ISBN that
# are followed by just one argument.
sub magicword ()
{
my ($tempname, $pagename, $seqno, $arg) = @_;
print "$tempname|$pagename|$seqno|1|$arg\n";
return "$tempname $arg";
}
# Parse Wikipedia pages-articles XML dump
my ($title, $text, $append) = ("", "", 0);
while (<>) {
if (/\<title\>(.*)\<\/title\>/) {
$title = $1;
$text = "";
next;
}
$append = 1 if /\<text/;
$text .= $_ if $append;
if (/\<\/text\>/) {
my $i;
$append = 0;
$text =~ s/\n/ /g;
$text =~ s/.*\<text[^\>]*\>(.*)\<\/text\>.*/$1/;
# Perform various substitutions to get rid of troublesome
# wiki markup. In its place, leave $something
# silently drop HTML comments
$text =~ s/&lt;!--.*?--&gt;//g;
# ignore nowiki, non-greedy match, leave $nowiki
$text =~ s/&lt;nowiki&gt;.*?&lt;\/nowiki&gt;/\$nowiki/g;
# ignore math, non-greedy match, leave $math
$text =~ s/&lt;math&gt;.*?&lt;\/math&gt;/\$math/g;
# wiki link with alternative text, leave $!
# multiple passes handle image thumbnails
for ($i = 0; $i < 5; $i++) {
$text =~ s/(\[\[[^\]\|{}]*)\|([^\]{}]*\]\])/$1\$!$2/g;
}
# These are not real template calls, leave $pagename
$text =~ s/{{(CURRENT(DAY|DOW|MONTH|TIME(STAMP)?|VERSION|WEEK|YEAR)(ABBREV|NAME(GEN)?)?|(ARTICLE|NAME|SUBJECT|TALK)SPACE|NUMBEROF(ADMINS|ARTICLES|FILES|PAGES|USERS)(:R)?|(ARTICLE|BASE|FULL|SUB|SUBJECT|TALK)?PAGENAMEE?|REVISIONID|SCRIPTPATH|SERVER(NAME)?|SITENAME)}}/\$$1/g;
# template parameter value with default, leave $!
$text =~ s/{{{([^\|{}]*)\|([^{}]*)}}}/\$($1\$!$2)/g;
# template parameter values, leave $parameter
$text =~ s/{{{([^{}]*)}}}/\$($1)/g;
# template bang escape, leave $!
$text =~ s/{{!}}/\$!/g;
my $seqno = 1;
$text =~ s/(ISBN) +([0-9][- 0-9Xx]+[0-9Xx)/&magicword($1,$title,$seqno++,$2)/eg;
$text =~ s/(RFC) +([0-9]+)/&magicword($1,$title,$seqno++,$2)/eg;
# multiple passes handle nested template calls
for ($i = 0; $i < 5; $i++) {
# pretend that <ref> tags are a kind of template call
$text =~ s/&lt;ref(\s+name\s*=\s*(&quot;)?([^&\s]*[^&\s\/])(&quot;)?)?\s*(\/|&gt;([^{}]*?)&lt;\/ref\s*)&gt;/&reftag($title,$seqno++,$3,$6)/eg;
$text =~ s/{{\s*([^\|{}]*?)\s*(\|[^{}]*)?}}/&template($1,$title,$seqno++,$2)/eg;
}
# Debugging
# print "$title<>$text\n";
}
}