User:Mutante/mw-dupes
dupe detection in mediawiki stats table[edit]
create a dupe table in mw syntax from mysql[edit]
In one step: use an INNER JOIN on itself to detect and list name duplicates and output in mw table syntax style:
SELECT DISTINCT "|",m.id,m.si_sitename,m.name,m.total,m.statsurl INTO OUTFILE '/tmp/mw-dupes' FIELDS TERMINATED BY '\n| ' LINES TERMINATED BY '\n|- \n' FROM mediawikis m INNER JOIN mediawikis n ON m.si_sitename = n.si_sitename WHERE m.id <> n.id AND m.good=n.good AND m.total=n.total AND m.users=n.users ORDER BY m.si_sitename;
Then copy from /tmp to /var/www/ if ok, chown to www-data, and be able to copy / paste from wikistats page.
(Just need to paste between manually created table head [http://wikistats.wmflabs.org/mw-dupes ]).
style rules for dupe handling / which to keep[edit]
- a wiki is a real duplicate only if good,total,users AND si_sitename are identical. some may have identical names but are actually different.
- domain names win over raw IP addresses
- if it works: plain domain.org wins vs. wiki.domain.org, www.domain.org, etc. shortest first
- otherwise: wiki.domain.org wins vs. www.domain.org, but www. is better than a specific server name in a cluster
- if exists: domain.org wins vs. domain.com and other TLDs
- otherwise: domain.com wins vs. domain.net, domain.info, country TLDs.
- a domain name that includes the "si_sitename" name or is closer to it wins vs. alternate domain names
- URLs without port (standard 80), win over URLs with :8080, :8081, :81 etc.. do not include :80 either.
- http wins over https (we can remove/auto-update protocol later if wanted)
- if still in doubt, keep the shorter URL
table of dupes[edit]
currently clean.
id | name | si_sitename | total | url | |
---|---|---|---|---|---|
- pasted from http://wikistats.wmflabs.org/mw-dupes