User:Mutante/mw-dupes

From Meta, a Wikimedia project coordination wiki

dupe detection in mediawiki stats table[edit]

create a dupe table in mw syntax from mysql[edit]

In one step: use an INNER JOIN on itself to detect and list name duplicates and output in mw table syntax style:

SELECT DISTINCT "|",m.id,m.si_sitename,m.name,m.total,m.statsurl INTO OUTFILE '/tmp/mw-dupes' FIELDS TERMINATED BY '\n| ' LINES TERMINATED BY '\n|- \n'
FROM mediawikis m INNER JOIN mediawikis n ON m.si_sitename = n.si_sitename WHERE m.id <> n.id AND m.good=n.good AND m.total=n.total AND m.users=n.users ORDER BY m.si_sitename;

Then copy from /tmp to /var/www/ if ok, chown to www-data, and be able to copy / paste from wikistats page.

(Just need to paste between manually created table head [http://wikistats.wmflabs.org/mw-dupes ]).


style rules for dupe handling / which to keep[edit]

  1. a wiki is a real duplicate only if good,total,users AND si_sitename are identical. some may have identical names but are actually different.
  1. domain names win over raw IP addresses
  2. if it works: plain domain.org wins vs. wiki.domain.org, www.domain.org, etc. shortest first
    1. otherwise: wiki.domain.org wins vs. www.domain.org, but www. is better than a specific server name in a cluster
  3. if exists: domain.org wins vs. domain.com and other TLDs
    1. otherwise: domain.com wins vs. domain.net, domain.info, country TLDs.
  4. a domain name that includes the "si_sitename" name or is closer to it wins vs. alternate domain names
  5. URLs without port (standard 80), win over URLs with :8080, :8081, :81 etc.. do not include :80 either.
  6. http wins over https (we can remove/auto-update protocol later if wanted)
  7. if still in doubt, keep the shorter URL

table of dupes[edit]

currently clean.

id name si_sitename total url