Talk:WikiXRay/Python parser

From Meta, a Wikimedia project coordination wiki

Generating also text.sql?[edit]

Question; Is there any chance that you could generate SQL also for the text table? (as an extra option)

It would be nice if the WikiXRay Python parser could become an alternative to Xml2sql (it generates only the "big three" tables).

I also wonder if the generated sql uses extended inserts (if so it would be nice to control the (preferred min) SIZE of the inserts with a parameter (as opposed to the number of insterts)). Regards, // RIL 16:21, 6 July 2007 (UTC)[reply]

--

Sure, I'm now working in what I call the standard version, with no processing of the number of letters and words, simply copying the text of every revision in another file text.sql.
It's pretty straigthforward to write the output to a file text.sql. The problem is that the uncompressed English Wikipedia is far beyond 700GB. So, I'm trying to write the output directly to the standard output, so you can directly pipe the INSERTS to MySQL (instead of writing a huge SQL file).
My problem now is how to write in the stdout with the UTF-8 codec I already use for files.
About the extended inserts, I didn't know they exist. I can't find a model about their syntax, if they are faster it could be really interesting.
Could anyone provide a model to write extended INSERTs for MySQL?
--GlimmerPhoenix 12:10, 7 July 2007 (UTC)[reply]

--

Extended INSERTS are very much faster (orders of magnitude) and have a very simple syntax:
  • INSERT INTO tbl_name VALUES (field1,'field2','field3'),(2,'abc','def'),(3,'ghi','jkl'),(...);
See also http://dev.mysql.com/doc/refman/5.0/en/insert.html
Example from the Swedish WP:
  • INSERT INTO `sv_text` VALUES (727175, '''''''Jorum'''''' är ett...', ''),(727184, '#REDIRECT [[Carl Sandberg]]', ''),(727187, '''''''Hemtextilier'''''' är...', ''),(727190, '''''''Textilt konsthantverk'''''' benämns...', ''),(...), etc.;
Default max size 156K, (256K is often the upper limit for most tools[1] for upload to hosted servers).
Default max number of inserts 50.000 (default in phpAdmin etc).
Regarding uploads to hosted servers - It is of special value for upload to hosted web-servers that the size of each row (each extended INSERT) can be specified (although I don't use hosted myself). PHP scripts for uploading huge sql files (such as Bigdump[2] which BTW is one of the best) chokes if the inserts becomes too big (only one big article amongst many small oness means that your extended INSERT can become way too big. For example Xml2sql[1] allow for specifying only the number of inserts, a limitation which means that xml dumps containing a big article (tyically > 300K) causes mySQL to time out.
Parser logic for extended INSERTS: Always allow ONE insert (regardless of size), and continue appending more inserts (only) if a lookahead size check returns:
  1. <= MaxSize and
  2. <= MaxNumberOfInserts.
Other optimizations - As for me I would have use for two separate sql files with sql for optimization. (Such sql should of course be kept out of the big sql dumps in order to make the use of them optional). I always use the following optimizations:
enable_optimization.sql
SET AUTOCOMMIT=0;
SET UNIQUE_CHECKS=0;
SET FOREIGN_KEY_CHECKS=0;
ALTER TABLE `page` DISABLE KEYS;
ALTER TABLE `revision` DISABLE KEYS;
ALTER TABLE `text` DISABLE KEYS;
disable_optimization.sql
SET UNIQUE_CHECKS=1;
SET FOREIGN_KEY_CHECKS=1;
ALTER TABLE `page` ENABLE KEYS;
ALTER TABLE `revision` ENABLE KEYS;
ALTER TABLE `text` ENABLE KEYS;
COMMIT;
A typical command file (in Windows) would then look something like;
drive:\path\mysql.exe -u name --password=notell db_name < drive:\path\enable_optimization.sql
drive:\path\mysql.exe -u name --password=notell db_name < drive:\path\page.sql
drive:\path\mysql.exe -u name --password=notell db_name < drive:\path\revision.sql
drive:\path\mysql.exe -u name --password=notell db_name < drive:\path\text.sql.sql
drive:\path\mysql.exe -u name --password=notell db_name < drive:\path\disable_optimization.sql
Implementing all the above would make the parser very useful for a wide range of users. For Wikipedia mirrors one would also have use for
  1. Stripping out entire namespaces (by skip-list of namespace names; --skipnamespaces=name1,name2,...).
  2. Injection of an optional string (--inject=) at the very start of article text. Example:
    <text xml:space="preserve">insert a param-string here\nThe rest of the article...</text>)
  3. Generating sql (as above) OR direct import into mySQL.
Regards, RIL 16:10, 7 July 2007 (UTC)[reply]

--

Ok, I'm going to undertake the above suggestions for both versions. Give me a little bit of time to implement them properly. Stay tunned... (I will announce the release both in the WikiXRay_Python_Parser page and in Wikitech-l mail list.
Thank you very much for your detailed suggestions.
--GlimmerPhoenix 10:24, 8 July 2007 (UTC)[reply]

Dump tools mentioned[edit]

  1. a b xml2sql - a tool for xml dump to sql dump
  2. BigDump - A small php-script for importing very large mySQL dumps (Even through web-servers with hard runtime limits or Safe mode!)