Jump to content

Talk:Data dumps/xml2sql

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 9 months ago by AKA MBG in topic Download

redirect tag


Xml2sql is a great tool, however it needs to be changed to fit a very simple change. I have not been however able to do it so far, but someone with the required knowledge could fix it.

The XML schema changed and the current xml2sql programme doesn't work. If you run it using a recent dump (eg from October 2009), you'll get this error:

$ bzcat enwiki-latest-pages-articles.xml.bz2 | ./xml2sql unexpected element <redirect> ./xml2sql: parsing aborted at line 33 pos 16.

The problem is the "<redirect />" element in the XML file. xml2sql doesn't know what to do with it and so stops. Each article has a "<redirect>" tag, and it doesn't change for any of the articles.

This is taken from here. http://blog.technomancy.org/2009/10/21/how-to-import-current-wikipedia-dumps

Can anybody look into this and add this change? It is simply a matter of adding a simple case since redirect has not attributtes. Anybody can help?

Also, will everything work if we simply strip "<redirect />" off?

Thanks for any comment or help.

Hi! Yes, I think that we can delete every line in XML dump with text "<redirect />". E.g. use my simple perl script xml2sql_helper.pl with the following format:
perl xml2sql_helper in_file out_file
Then xml2sql produced SQL files without error messages. Good luck! -- AKA MBG 10:07, 30 December 2009 (UTC)Reply
If you want to strip of the redirect-tags without adding another step, you can change the bzcat-line into:
bzcat enwiki-latest-pages-articles.xml.bz2 | grep -v "<redirect />" | ./xml2sql

/Fluff 23:27, 11 January 2010 (UTC)Reply

It is is a pity but I realized that stripping this line as you suggest will not make work redirects as you expect.
In my case, redirects will not redirect automatically as they did before this change was needed.
Now, for a redirect, you will be presented with a page saying: Redirection to "Other article" but it will not be done automatically.
Could somebody see how to fix xml2sql to properly incorporate this tag to the database?
Many thanks again!
It seems redirect is working again after I imported the dump in 1.14 and run update.php to 1.15.1 ...
Not sure if this means that if importing directly in 1.15.1 will work straigh away. -- 07:46, 27 January 2010 (UTC)Reply
A bit messy but after importing with either the perl-script above or with grep -v you could do an update with the where condition: old_text like '^#REDIRECT%' /Fluff 22:01, 26 April 2010 (UTC)Reply
patch for xml2sql-0.5
--- xml2sql-0.5/keywords	2005-07-30 15:17:42.000000000 +0900
+++ mywork/keywords	2012-03-08 12:56:37.000000000 +0900
@@ -20,6 +20,7 @@
+  el_redirect,
 struct eltmap { char *name; enum element t; };
@@ -44,3 +45,4 @@
 minor,        el_minor
 comment,      el_comment
 text,         el_text
+redirect,     el_redirect
--- xml2sql-0.5/xml2sql.c	2006-02-08 19:34:14.000000000 +0900
+++ mywork/xml2sql.c	2012-03-08 13:18:02.000000000 +0900
@@ -749,9 +749,12 @@
 		page.lasttid = textid;
 		page.lastts = strdup(revision.timestamp);
 		if(revision.text) {
 			page.redir =
 				strncasecmp(revision.text, "#REDIRECT", 9) == 0 &&
 				strstr(revision.text, "[[");
+			page.redir = 0;
 			page.len = strlen(revision.text);
@@ -1022,6 +1025,10 @@
 			md5(revision.text, revision.md5);
+	case el_redirect:
+		if(!page.skip && tlen) page.redir = 1;
+		break;
 	current = elstack[--elidx];

--Saoyagi2 (talk) 04:43, 8 March 2012 (UTC)Reply

DiscussionThreading tag


I 've got today one more error message from xml2sql about unexpected element <DiscussionThreading>.

So, I 've modified my perl script xml2sql_helper.pl, which should be launched before xml2sql. Now this script:

  • Deletes redirect lines in the text file <redirect />.
  • Deletes <DiscussionThreading>\n...\n</DiscussionThreading>.

Best regards. -- Andrew Krizhanovsky 13:15, 2 September 2010 (UTC)Reply

namespace tag


Hi, unfortunatelly the tool also ignores the ns tag. But for me it is interesting to know the namespaces. Is there any chance to get the source code for xml2sql?



Links to xml2sql binary files do not work anymore. So you can use these files (I found them on my old computer :):