Talk:Data dumps/mwimport

From Meta, a Wikimedia project coordination wiki

siteinfo: untested generator 'MediaWiki 1.17wmf1'[edit]

cat enwiki-20110901-pages-articles.xml | ~/firstdump/mwimport | mysql -f -u admin -p pw  dbname produces 

then it produces a long mysql man page

siteinfo: untested generator 'MediaWiki 1.17wmf1', expect trouble ahead

Help! Does this mean there is something wrong with the mysql command not the mwimport script?


This is not a problem: It just tells you, that the dump was created with a newer version of the WikiMedia software than the script was tested for. Propably it will work - just test it. FlySoft (talk) 19:49, 10 June 2012 (UTC)[reply]

compilation errors in mwimport[edit]

I know you're not the maintainer of this script, but perhaps someone can figure these compilation errors out:

  • syntax error at ./mwimport.pl line 60, near "= ;"
  • syntax error at ./mwimport.pl line 72, near "= ;"
  • syntax error at ./mwimport.pl line 184, near "= ;"
  • syntax error at ./mwimport.pl line 299, near "= ;"

I am running this on Ubuntu Linux, and can't, for the life of me, find out what "= ;" should do.


I have an identical issue on Kubuntu --12.5.52.170 08:13, 16 January 2007 (UTC)[reply]

Yes, Mediawiki was munging my code; I've now switched to using <pre> so that cut and paste should again work. Sorry for the hassle, Robbe 14:08, 23 January 2007 (UTC)[reply]

How do I "pipe it into SQL"?[edit]

Can anybody explain a short way to use the script to actually import the dump from xml into the SQL database? (MySQL5) And are there any settings I have to adjust within the script?

I think this could be very helpful to a lot of people as the importDump.php is really slow... :-(

Thank You!


Hi! I assume you're using Linux, then there is an example in the article, but I'll explain how it worked for me:

  • bzcat dewiki-20120603-pages-articles.xml.bz2 | perl mwimport.pl | mysql -f -u [USERNAME] -p [DATABASE]

(Assuming: You use the german wiki, and have perl installed)

Simply adjust dewiki-[...].xml.bz to the correct file, and replace [Username] etc... with the correct values for your database. Then press enter, you'll be asked for your mysql-password, enter it. Then you'll see the export running.


What does this command do? Actually very easy: It extracts the content of the xml.bz2-File into the [stdout] of Linux. Now the script (mwimport) takes over the data. It converts the XML to SQL, and writes the sql to the sdout. After that, mysql reads that stuff, and it is in your database.

And all 3 Commands are running at the same time ;)

I hope I helped you -- FlySoft (talk) 19:57, 10 June 2012 (UTC)[reply]

Successful execution of mwimport on MS W2k ![edit]

This code performed successfully for import of itwiki, frwiki and dewiki into fresh mediawiki-1.10.1 on a Windows_2000 machine (Active Perl 5.8.8 Build 820):

  type enwiki-<date>.xml | perl mwimport | mysql -f -u <admin name> -p <database name>

Note: type command was used instead of cat. Command perl was added before mwimport.


Output displayed by mwimport after start of script:

  siteinfo: untested generator 'Mediawiki 1.11alpha', expect trouble ahead

But there was no "trouble ahead" ! Output displayed by mwimport after end of script (dewiki):

  1248932 pages (122.026/s),1248932 revisions (122.026/s) in 19235 seconds

Big thanks to the author of mwimport !

  type enwiki-<date>.xml | perl mwimport | mysql -f -u <admin name> -p <database name>

What does mytsql means in the above command? Is that where you have to enter the Database IP adress? MahdiTheGuidedOne (talk) 06:56, 12 July 2012 (UTC)[reply]

Trouble importing enwiki[edit]

Trying to import enwiki using the W2k modification above mwimport repetably displays following error message:

  page revision: eof at line 52096161 (committed 1038413 pages)

Execution of mwimport stops after 1038413 pages.

As a newbee in perl, I would kindly ask cooperators to provide some error handling for this sub routine:

sub getline()
{
  $_ = <>;
  defined $_ or die "eof at line $.\n";
}


I need help with the same thing too if anyone knows a fix please.I tried everything with no luck. Execution of mwimport stops after 1038413 pages.

updates[edit]

hey ppl,

I did to changes in order to let it work with the current version of mediawiki (1.11.0) and the current version of xml.bz2 dump (2007-10-10)

-> mwimport.pl:


  - ."$page{redirect},0,RAND(),NOW(),$page{latest},$page{latest_len}),\n";
  + ."$page{redirect},0,RAND(),NOW()+0,$page{latest},$page{latest_len}),\n";


This fixed "a bug" causing a wrong inserted timestamp format for the current style of mysql-table.


  -> cat enwiki-<date>.xml | mwimport | mysql -f -u <admin name> -p <database name>
  -> cat enwiki-<date>.xml | mwimport | mysql -f -u<admin name> -p<admin password> --default-character-set=utf8 <database name>



On a standard installation (debian etch) mysql interprets inputed data-streams as latin1 charset, with the upper line you can set the charset to utf8 (like how the dumps are converted, too)

Btw: you can pipe the xml through bzip2, to unzip the xml on-the-fly.


Just change the revision if you feel uncomfortable with the changes ;) Italic text


Problem with above commands[edit]

Hello,

I have followed the above procedures to import xml dump, but i have struck because of following error.

-bash: mwimport: command not found.

I have stored the xml dump and mwimport.pl file in public_ftp folder.

My mediawiki installation is on Linux host.


Can anyone please suggest solution to my problem.


Answer: It looks like mwimport.pl is not in your path, and that the current directory ('.') isn't either. Try preceding the script name with './', like so:

./mwimport.pl


Error:

Hi!! I have downloaded the latest file from site "(enwiki-latest-pages-articles.xml)" . i uncompressed it and when i process it with mwimport i get the following error :

1421000 pages (1234.579/s),   1421000 revisions (1234.579/s) in 1151 seconds
 1422000 pages (1235.447/s),   1422000 revisions (1235.447/s) in 1151 seconds
 1423000 pages (1235.243/s),   1423000 revisions (1235.243/s) in 1152 seconds
 1424000 pages (1236.111/s),   1424000 revisions (1236.111/s) in 1152 seconds
 1425000 pages (1235.906/s),   1425000 revisions (1235.906/s) in 1153 seconds

page: revision: expected contributor element in line 85481733

(committed 1424471 pages)

Please tell me the way to get around it.


==============================[edit]

With this first line of the file (enwiki-20100130-pages-articles.xml):

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.4/ http://www.mediawiki.org/xml/export-0.4.xsd" version="0.4" xml:lang="en">

mwimport.sh gives: "unknown schema or invalid first line"

Looking at the line that generates the error shows it is a version issue 0.3 vs 0.4:

m|^<mediawiki \Qxmlns="$SchemaLoc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="$SchemaLoc $Schema" version="$SchemaVer"\E xml:lang="..">$| or die "unknown schema or invalid first line\n";

Any suggestions? Thanks



Hi! I think you need change the line 339 to: my $SchemaVer = '0.4';

It works for me, but i have the next problem with the enwiki-20100312-pages-logging.xml file:

mediawiki: expected closing tag in line 32

   27	      <namespace key="101">Portal talk</namespace>
   28	      <namespace key="108">Book</namespace>
   29	      <namespace key="109">Book talk</namespace>
   30	    </namespaces>
   31	  </siteinfo>
   32	    <logitem>
   33	      <id>1</id>
   34	      <timestamp>2004-12-23T03:20:32Z</timestamp>
   35	      <contributor>

I don't know what is happening. Any ideas?

Thank you very much!



You should use the old 0.4 version of this script: http://meta.wikimedia.org/w/index.php?title=Data_dumps/mwimport&oldid=2111539

Where are all the models?[edit]

I finally got it to import the latest and greatest sql-dumps by adding

 simple_elt title => \%page;
 simple_elt ns => \%page;
 simple_elt id => \%page;
 simple_opt_elt redirect => \%page;
 simple_elt sha1 => \%page;

and changing the simple_opt_elt to:

 sub simple_opt_elt($$)
 {
   if (m|^\s*<$_[0]\s*.*/>\n$|) {
     $_[1]{$_[0]} = ;
   } elsif (m|^\s*<$_[0]>(.*?)</$_[0]>\n$|) {
     $_[1]{$_[0]} = $1;
   } else {
     return;
   }
   getline;
 }

But even if I import the pages-meta-current, it doesn't show the models. I get the pages with the text, but whenever there is a model, it fails. Any ideas?

Can't get it to work[edit]

I just tried using mwimport and got the following error:

unknown schema or invalid first line

So I changed

my $SchemaVer = '0.5';

to

my $SchemaVer = '0.6';

Now I get:

siteinfo: untested generator 'MediaWiki 1.19wmf1', expect trouble ahead
page: expected id element in line 32
 (committed 0 pages)

Any help here? Totally new at this. Freaky Fries (talk) 09:18, 5 April 2012 (UTC)[reply]

I just updated the script to allow for schema 0.6. Does it work now? -- MarkLodato (talk) 04:40, 29 April 2012 (UTC)[reply]
With this new version I get the following error: -- (peter [at] rlsux [dot] com) 08:37, 30 May 2012
Extracting dewiki-20120413-pages-meta-history.xmlUse of qw(...) as parentheses is deprecated at ./mwimport-0.6.pl line 280.
Use of qw(...) as parentheses is deprecated at ./mwimport-0.6.pl line 332.
page: expected closing tag in line 34
(committed 0 pages)
Line 34 in the dump is: <sha1 />

Add perl before mwimport[edit]

Hi there, i needed to add perl in front of mwimport. It should be save to add this on the main page I think. That commands runs for everyone ;)

If I'm wrong correct me ;)

-- FlySoft (talk) 19:59, 10 June 2012 (UTC)[reply]

mysql?[edit]

  type enwiki-<date>.xml | perl mwimport | mysql -f -u <admin name> -p <database name>

What does mytsql means in the above command? Is that where you have to enter the Database IP address? MahdiTheGuidedOne (talk) 06:56, 12 July 2012 (UTC)[reply]

updated for 0.9[edit]

Updated script to work with schema version 0.9, added dbname to parser. Olytechy (talk) 19:56, 19 November 2014 (UTC)[reply]