User:This, that and the other/interwiki

From Meta, a Wikimedia project coordination wiki

Counting interwiki links[edit]

To obtain a count of how many pages use each interwiki prefix (only those greater than 4 characters, in order to exclude the very common "w:" etc) on each Wikimedia wiki, do the following:

  1. Request a Tool Labs account if you don't already have one; see wikitech:Nova Resource:Tools
  2. Download the file http://noc.wikimedia.org/conf/all.dblist and scp it over to your home directory on Tool Labs:

    scp /path/to/all.dblist username@tools-login.wmflabs.org:~/all.dblist

  3. Log into Tool Labs:

    ssh username@tools-login.wmflabs.org

  4. Execute this mongrel of a command at the terminal:

    colnames=; while read p; do echo $p; mysql --defaults-file=~/replica.my.cnf -B $colnames -h ${p}.labsdb -e "select database(), iwl_prefix as prefix, count(1) as count from iwlinks where length(iwl_prefix) > 4 and iwl_prefix <> 'commons' group by iwl_prefix order by iwl_prefix" ${p}_p >> iwcount; colnames=-N; done < all.dblist

    The server takes a very long time to get through commonswiki, because there are over 84 million entries in its interwiki links table. If you want to skip Commons (or any other wiki that the script is stuck on), just press Ctrl+C and the script will move on. If you don't mind waiting, the script will get through the big wikis after a couple of minutes or so without a problem.

    You will see ERROR 2005 when the script tries to access private wikis (these databases are obviously not available on Tool Labs). You can just ignore these errors.

  5. The file iwcount now contains a count of the different interwiki prefixes, in tab-separated-values format. Exit out of your remote terminal (logout), and use scp again to download it:

    scp username@tools-login.wmflabs.org:~/iwcount /path/to/iwcount

You can import the resulting file into Excel or another spreadsheet program. Excel's PivotTable function is incredibly useful for manipulating the data into different formats.

Listing uses of a single specific interwiki[edit]

Follow the above steps, but run this command instead:

prefixname=meta; colnames=; while read p; do echo $p; mysql --defaults-file=~/replica.my.cnf -B $colnames -h ${p}.labsdb -e "select database(), iwl_prefix as prefix, concat(iwl_prefix,':',iwl_title) as iw_link_text, page_namespace as ns, page_title as title from iwlinks inner join page on iwl_from=page_id where iwl_prefix = '${prefixname}' order by iwl_title" ${p}_p >> iw_${prefixname}; if [ -s "iw_${prefixname}" ]; then colnames=-N; fi; done < all.dblist

Set the prefixname variable at the start to the prefix you are interested in (must be lowercase). Leave colnames= blank. A list of usages will be written to the file iw_prefix, e.g. iw_meta in the example above.

Or use PiRSquared17's tool: toollabs:pirsquared/iw.php.

Listing uses of several interwikis at once[edit]

Follow the above steps, but run this command instead:

prefixes="'meta','schoolswp','wikia','wmf'"; colnames=; while read p; do echo $p; mysql --defaults-file=~/replica.my.cnf -B $colnames -h ${p}.labsdb -e "select database(), iwl_prefix as prefix, concat(iwl_prefix,':',iwl_title) as iw_link_text, page_namespace as ns, page_title as title from iwlinks inner join page on iwl_from=page_id where iwl_prefix in (${prefixes}) order by iwl_prefix, iwl_title" ${p}_p >> iwlist; if [ -s "iwlist" ]; then colnames=-N; fi; done < all.dblist

Prefixes must be lowercase. Note that each prefix and is enclosed in single quotes, while the whole list of prefixes is surrounded by double quotes. Again, leave colnames= blank. The output is written to a file called iwlist (note that this filename appears twice in the above command).