User:LA2/Extraktor
From Meta, a Wikimedia project coordination wiki
Extraktor is a Perl script that parses a Wikipedia pages-articles.xml dump, such as can be downloaded from http://download.wikimedia.org/ (the archive most mirror sites will probably want), and outputs an index of all template call parameters. This was first suggested by LA2 in August 2005 on talk:Wikidata as "a minimalist approach", and the script was written one year later.
Contents |
[edit] History
- December 3, 2007: ISBN linking to Wikipedia, announcement by Eric Hellman on the Web4Lib mailing list that OCLC is following LibraryThing's example in linking to ISBNs in Wikipedia
- September 5, 2007: Allow whitespace in ISBNs. For a discussion, see en:Wikipedia talk:ISBN. Here are current statistics from some Wikipedias.
- September 4, 2007: Treat <ref> tags as template calls
- July 21, 2007: An alternative way to extract structured data from Wikipedia dumps is the DBpedia extraction software from the DBpedia project
- February 26, 2007: Wikipedia Citations, with feed, featured by LibraryThing.com
- September 8, 2006: Added support for recognizing the RFC and ISBN magic keyword patterns
- August 30, 2006: Wikitech-l announcement of this script
[edit] Usage
For example, the article en:Ophir, Alaska contains this text:
{{otherusesof|Ophir}}
'''Ophir''' is an unincorporated area located
at {{coor dms|63|08|41|N|156|31|10|W|}}
Copy the source code below and save as extraktor.pl, then apply to the uncompressed XML dump like this:
perl extraktor.pl <enwiki-20060816-pages-articles.xml >enwiki.parameters
For the page above, this script should produce the following output:
Otherusesof|Ophir, Alaska|1|1|Ophir Coor dms|Ophir, Alaska|2|1|63 Coor dms|Ophir, Alaska|2|2|08 Coor dms|Ophir, Alaska|2|3|41 Coor dms|Ophir, Alaska|2|4|N Coor dms|Ophir, Alaska|2|5|156 Coor dms|Ophir, Alaska|2|6|31 Coor dms|Ophir, Alaska|2|7|10 Coor dms|Ophir, Alaska|2|8|W Coor dms|Ophir, Alaska|2|9|
In this simple case, there are only position parameters without names, and no advanced syntax. The output is line-oriented with the vertical bar (|) as field separator. Each line has three or five fields.
- The first field is the name of the called template.
- The second field is the name of the page from where the template was called.
- The third field is a sequence number of this call within the page.
- The fourth field is the name (where applicable) or positional number of a parameter in this call.
- The fifth field is the value of this parameter in this call.
For template calls without parameters, e.g. {{stub}}, fields 4 and 5 are left out.
[edit] Disclaimer
This simple script doesn't really parse wikimarkup. It only applies some ad-hoc regular expression substitutions. (Note: this is also what the actual MediaWiki "parser" does!) For practical reasons, some difficult syntax such as <math> is simply replaced with the placeholder $math. These placeholders use Perl-like syntax such as $PAGENAME, $(14), [[link $! linktext]] and @(Taxobox:2).
This is a quick hack, not at all "a piece of German engineering".
[edit] Source code
#!/usr/bin/perl -w use utf8; # use encoding 'utf8'; use locale; use POSIX; use strict; # setlocale(LC_CTYPE, "UTF-8"); # Treat <ref> tags as a kind of template call sub reftag () { my ($pagename, $seqno, $name, $value) = @_; if (defined($name)) { print "<ref>|$pagename|$seqno|name|$name\n"; } if (defined($value)) { # escape any remaining vertical bars $value =~ s/(http:[^ ]*)\|/$1%7c/g; $value =~ s/\|/\$!/g; print "<ref>|$pagename|$seqno|1|$value\n"; } return "\@(<ref>:$seqno)"; } # Parse arguments of one template call sub template () { my ($tempname, $pagename, $seqno, $args) = @_; # {{foo bar}} calls Template:Foo_bar with upper case F. # Might not work for non-ASCII characters. $tempname = ucfirst($tempname); $tempname =~ s/ /_/g; if (defined ($args)) { my ($field, $value); my $argc = 0; $args =~ s/^\|\s*(.*?)\s*$/$1/; foreach my $arg (split(/\s*\|\s*/, $args)) { $argc++; if ($arg =~ /^([^=]*?)\s*=\s*(.*)$/) { $field = $1; $value = $2; } else { $field = $argc; $value = $arg; } print "$tempname|$pagename|$seqno|$field|$value\n"; } } else { # Template call without parameters. Only print three fields print "$tempname|$pagename|$seqno\n"; } return "\@($tempname:$seqno)"; } # This is like sub template() but for magic keywords RFC and ISBN that # are followed by just one argument. sub magicword () { my ($tempname, $pagename, $seqno, $arg) = @_; print "$tempname|$pagename|$seqno|1|$arg\n"; return "$tempname $arg"; } # Parse Wikipedia pages-articles XML dump my ($title, $text, $append) = ("", "", 0); while (<>) { if (/\<title\>(.*)\<\/title\>/) { $title = $1; $text = ""; next; } $append = 1 if /\<text/; $text .= $_ if $append; if (/\<\/text\>/) { my $i; $append = 0; $text =~ s/\n/ /g; $text =~ s/.*\<text[^\>]*\>(.*)\<\/text\>.*/$1/; # Perform various substitutions to get rid of troublesome # wiki markup. In its place, leave $something # silently drop HTML comments $text =~ s/&lt;!--.*?--&gt;//g; # ignore nowiki, non-greedy match, leave $nowiki $text =~ s/&lt;nowiki&gt;.*?&lt;\/nowiki&gt;/\$nowiki/g; # ignore math, non-greedy match, leave $math $text =~ s/&lt;math&gt;.*?&lt;\/math&gt;/\$math/g; # wiki link with alternative text, leave $! # multiple passes handle image thumbnails for ($i = 0; $i < 5; $i++) { $text =~ s/(\[\[[^\]\|{}]*)\|([^\]{}]*\]\])/$1\$!$2/g; } # These are not real template calls, leave $pagename $text =~ s/{{(CURRENT(DAY|DOW|MONTH|TIME(STAMP)?|VERSION|WEEK|YEAR)(ABBREV|NAME(GEN)?)?|(ARTICLE|NAME|SUBJECT|TALK)SPACE|NUMBEROF(ADMINS|ARTICLES|FILES|PAGES|USERS)(:R)?|(ARTICLE|BASE|FULL|SUB|SUBJECT|TALK)?PAGENAMEE?|REVISIONID|SCRIPTPATH|SERVER(NAME)?|SITENAME)}}/\$$1/g; # template parameter value with default, leave $! $text =~ s/{{{([^\|{}]*)\|([^{}]*)}}}/\$($1\$!$2)/g; # template parameter values, leave $parameter $text =~ s/{{{([^{}]*)}}}/\$($1)/g; # template bang escape, leave $! $text =~ s/{{!}}/\$!/g; my $seqno = 1; $text =~ s/(ISBN) +([0-9][- 0-9Xx]+[0-9Xx)/&magicword($1,$title,$seqno++,$2)/eg; $text =~ s/(RFC) +([0-9]+)/&magicword($1,$title,$seqno++,$2)/eg; # multiple passes handle nested template calls for ($i = 0; $i < 5; $i++) { # pretend that <ref> tags are a kind of template call $text =~ s/&lt;ref(\s+name\s*=\s*(&quot;)?([^&\s]*[^&\s\/])(&quot;)?)?\s*(\/|&gt;([^{}]*?)&lt;\/ref\s*)&gt;/&reftag($title,$seqno++,$3,$6)/eg; $text =~ s/{{\s*([^\|{}]*?)\s*(\|[^{}]*)?}}/&template($1,$title,$seqno++,$2)/eg; } # Debugging # print "$title<>$text\n"; } }