WikiTeam/Dumpgenerator rewrite

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

dumpgenerator.py is too unreliable and should use less custom hacks; we should rewrite it.[1] Here are some specifications on what's needed. See also a previous hackish attempt.[2]

Feel free to edit and comment!

  • The current idea, proposed by Betacommand, is to rely only on current PWB API+screenscraping features, which should work for MediaWiki 1.10+; all the older releases would be supported only by the old dumpgenerator. From a rough estimate on a couple hundreds wikis, it seems PWB can handle 90 % of them. Implementation aspects to verify on PWB:
    • non-API: there's no support for screen scraping in core at the moment (and last time it was checked, it was broken in compat, too) but in core, it should be relatively easy to add, by creating a new type of Site object ('ScreenscrapingSite' instead of 'APISite');
    • API XML export support [3];
    • httplib2, current external dependency (possibly replaced in the future by Requests):
      • handling redirects and weird status codes,
      • setting a custom user agent,
      • multithreaded,
      • customising delay between requests also based on status codes (or lack thereof);
    • PWB API confirmed misc advantages:
      • fix [4]: if the api call uses api.CachedRequest, it will write to the disk.
    • API nice to have:
      • meta=siteinfo for extensions and everything [5],
      • logs,
      • user and file metadata.
  • Take as input:
    • URL to index.php or api.php, or both if needed (knowing both is required, but (in current mediawiki) the API knows the index.php URL, and index.php includes the api.php URL),
    • choice of namespaces and if dumping text/images/both,
    • choice of current revisions/complete history
    • [optional] list of titles
  • Produce a list of any and all pages in the wiki,
    • with the possibility to exclude some namespaces,
    • falling back to Special:AllPages when API is not available,
    • [highly preferred] handling weird screenscraping failures[6].
  • Download the XML of all selected pages in a single XML file.
    • Last revision or full history, depending on user's choice.
    • Using Special:Export, or ensuring that the API XML has the very same format (formats changed a lot across releases). Example XML.
    • Handle deletion or other problems occurred to the pages in the meanwhile.
    • Handle big histories:
      • with more than 1000 revisions or whatever is the limit for export on the wiki, up to dozens thousands revisions;
      • [optional, low priority] choose best behaviour depending on the available amount of RAM;
      • when there's only Special:Export and it doesn't accept parameters to export older revisions[7];
      • where the size of all revisions combined amounts to several GB;
      • ensuring to have really reached the end of the history and to have a valid XML file (bugzilla:29961).
    • [optional] Handle pagename and aliases/URL rewrite issues (including namespaces).[8]
  • List all files on the wiki (if images dump selected):
    • get from Special:ListFiles if no API available;
    • handle Special:ListFiles problems with big requests (by scaling down to 50 files listed at once, or less);
    • [optional] handle Special:ListFiles offset weirdnesses[9] and other HTML peculiarities [10];
  • [highly preferred] save both file description pagename and file path.
  • Download all listed files:
    • originals (however big they are and in all formats),
    • file description for copyright reasons etc.,
    • save file description as XML of the page alongside each file,
    • handle filename problems and in particular filenames which are too long by cropping in some way, e.g. first 100 chars + 32 (md5sum) + ext;
    • [optional, lowest priority] handle saving files in crappy Windows FS.
  • Save information about the wiki.
    • Special:Version in HTML format, for release info, list of extensions, custom tags and whatever (the XML could be useless otherwise).
    • [highly preferred] Also main page in HTML format, to have a basic example of the look and description of the wiki, plus info stored only in the footer by old releases or misconfigured wikis, like copyright info (cf. mw:Manual:$wgRightsText and related pages).
    • [optional, low priority] Other pages in HTML like Project:About etc. if available.
  • [highly preferred] Save stuff in the same format as before,[11][12] i.e. domainorg_rootdir-20110520-wikidump/ directory with
    • images/ subdir for all files,
    • domainorg_rootdir-20110520-(history-txt|titles.txt|history.xml),
    • other stuff;
    • [required] or another format which is URL-friendly and contains both domain and date info;
    • [required] and by automatically creating the directory if no custom/resumable path is specified in the command.
  • [optional, low priority] Optionally download all the wiki's logs.
  • Handle random failures (retry many times and eventually die to let us go on).
    • Slow servers timing out without notice.
    • Special:Export or API export consistently or occasionally failing for DB problems or whatever.
    • other sources of invalid XML, HTTP errors etc. like
      • IOError: [Errno socket error] [Errno -2] Name or service not known
      • httplib.IncompleteRead: IncompleteRead
      • [less common] IOError: ('http protocol error', 0, 'got a bad status line', None)
      • [less common] IOError: [Errno 2]
    • Follow HTTP/HTTPS temporary or permanent redirects from the URL provided in command line.
    • Integrity check of the final XML (as XML and as mediawiki dump[13]) when download is complete or died.
  • Allow resuming download:
    • even if the history XML is now invalid (some page truncated at the end etc.);
    • [optional] by guessing the path to the previous dump;
    • [optional] automatically triggered when failing for temporary errors;
    • [optional] automatically triggered when the resulting XML is invalid.
  • Spider options
    • Possibility to add a delay/wait time for friendlier crawling (pwb: done)
    • Possibility to fake the User-Agent (pwb: done)
    • [optional] Remove the downloader's IP from MediaWiki's HTML comments
    • [optional] Possibility to login for private or otherwise restricted wikis (pwb: done)
    • [optional] If one tries to download WMF wikis, die and tell the user about the data dumps site
  • [optional, low priority] Leave some possibility to expand with download support for other wiki engines.
  • [optional, low priority] Rewrite also launcher.py and uploader.py scripts to:
    • mass-download hundreds/thousands of wikis with a single command;
    • compress[14] and upload[15] dump to archive.org with metadata via IA pseudo-S3 API[16][17] via boto[18] or other.