Jump to content

Talk:Spam blacklist

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by MER-C (talk | contribs) at 10:30, 9 June 2008 (→‎cccb.org: update oldid). It may differ significantly from the current version.

Latest comment: 16 years ago by MER-C in topic Proposed additions
Shortcut:
WM:SPAM
The associated page is used by the Mediawiki Spam Blacklist extension, and lists strings of text that may not be used in URLs in any page in Wikimedia Foundation projects (as well as many external wikis). Any meta administrator can edit the spam blacklist. There is also a more aggressive way to block spamming through direct use of $wgSpamRegex. Only developers can make changes to $wgSpamRegex, and its use is to be avoided whenever possible.

For more information on what the spam blacklist is for, and the processes used here, please see Spam blacklist/About.

Please post comments to the appropriate section below: Proposed additions, Proposed removals, or Troubleshooting and problems, read the messageboxes at the top of each section for an explanation. Also, please check back some time after submitting, there could be questions regarding your request. Per-project whitelists are discussed at MediaWiki talk:Spam-whitelist. In addition to that, please sign your posts with ~~~~ after your comment. Other discussions related to this last, but that are not a problem with a particular link please see, Spam blacklist policy discussion.

Completed requests are archived (list, search), additions and removal are logged.

snippet for logging: {{/request|1034749#{{subst:anchorencode:section name here}}}}

If you cannot find your remark below, please do a search for the URL in question with this Archive Search tool.

Spam that only affects single project should go to that project's local blacklist

Proposed additions

This section is for proposing that a website be blacklisted; add new entries at the bottom of the section, using the basic URL so that there is no link (example.com, not http://www.example.com). Provide links demonstrating widespread spamming by multiple users. Completed requests will be marked as done or denied and archived.

viartis.net

I thought this troublesome domain had been deep-blacklisted long ago; apparently not:

Background

Account (this time):



--A. B. (talk) 02:51, 1 June 2008 (UTC)Reply

A. B., you can next time just add this to the report in 'User:COIBot/XWiki/domain' (User:COIBot/XWiki/viartis.net in this case). There is a comment in there, all discussion behind that comment is not deleted, and just retained if the report is regenerated.
In all cases, Added Added. --Dirk Beetstra T C (en: U, T) 19:04, 1 June 2008 (UTC)Reply

mybuys.com

I spotted a block & page deletions here (here). There was one link on the top 10 search on en wp which I have removed. I do not see this as a valuable link for the project. It is more a case of the potential of spam than actual so other views welcome.




Thanks --Herby talk thyme 07:27, 2 June 2008 (UTC)Reply

Link has not been added for as far as the database of the linkwatchers go back (and are complete). I think we should not blacklist things when there is no abuse, but there is nothing wrong in monitoring it. --Dirk Beetstra T C (en: U, T) 13:04, 2 June 2008 (UTC)Reply
Agreed - it is not useful, but until it is abused. I will (try to) add it to the bots if not already done.  – Mike.lifeguard | @en.wb 01:15, 3 June 2008 (UTC)Reply

More satanismresource spam

satanismresource.fortunecity.com redirects to blacklisted domain geocities.com/satanismresource.

See Talk:Spam blacklist/Archives/2008/05#geocities.com/satanismresource





--A. B. (talk) 03:07, 3 June 2008 (UTC)Reply


Added Added --A. B. (talk) 03:26, 3 June 2008 (UTC)Reply


HedgeLender LLC spam

Spam domains




Related domains


















Accounts










Reference

--A. B. (talk) 04:52, 3 June 2008 (UTC)Reply

I agree this is cross-wiki spam and warrants listing. I think in this case it may be worthwhile to add the related domains too. Or just the ones spammed?  – Mike.lifeguard | @en.wb 14:28, 3 June 2008 (UTC)Reply
I didn't have time to blacklist myself and I'm in meetings all day today. Yes, I think they should all be blacklisted. --A. B. (talk) 17:17, 3 June 2008 (UTC)Reply
Added Added  – Mike.lifeguard | @en.wb 01:49, 4 June 2008 (UTC)Reply

Tangram software seller from spain



Spams since april wikiwide all Tangram pages to sell his software. Uses dynamic spanish IP-addies - see here for the IP-numbers used up until tonight. TIA and kind regards, MoiraMoira 18:28, 3 June 2008 (UTC)Reply

Agreed and blacklisted. --Erwin(85) 20:18, 3 June 2008 (UTC)Reply
Spam domains











Related domains











Google Adsense: 6158286478265594
There appear to be many more related domains


Spam accounts







Reference

--A. B. (talk) 04:01, 5 June 2008 (UTC)Reply

Added Added --A. B. (talk) 04:25, 5 June 2008 (UTC)Reply

Gallomedia spam

Spam domain



Related domains





















Accounts



Reference

--A. B. (talk) 04:11, 5 June 2008 (UTC)Reply

Added Added --A. B. (talk) 04:26, 5 June 2008 (UTC)Reply

supermodels.nl



Although this site has been blacklisted, there are still plenty of links on Wikipedia and it would be great if the removal could be done by a bot. According to Finjan Secure Browsing([see this screenshot]), AVG and several board threads, this site is infested with malware and badware. Besides that this site is unaccesible at times and whensoever it keeps loading and loading. Robomod 20:58, 7 June 2008 (UTC)Reply

This site is not blacklisted at meta, but at enwiki. Should we consider adding it here? I will take a look at removing some links now.  – Mike.lifeguard | @en.wb 21:10, 7 June 2008 (UTC)Reply

Agrred in a sense - however if it contains malware then prevention is better than cure - we have listed on that basis in the past. Added Added for now - can always be removed when the problem is clarified/dealt with - cheers --Herby talk thyme 06:52, 9 June 2008 (UTC)Reply

cccb.org



Spammers


Massive spam page creation and linkspamming across several projects from a confessed paid editor. See w:WT:WPSPAM#spam.cccb.org (permanent link).

I'd like a steward to deal with this request in order to unify and lock the account. MER-C 09:49, 9 June 2008 (UTC)Reply

Proposed additions (Bot reported)

This section is for websites which have been added to multiple wikis as observed by a bot.

Items there will automatically be archived by the bot when they get stale.

Sysops, please change the LinkStatus template to closed ({{LinkStatus|closed}}) when the report is dealt with, and change to ignore for good links ({{LinkStatus|ignore}}). More information can be found at User:SpamReportBot/cw/about

These are automated reports, please check the records and the link thoroughly, it may be good links! For some more info, see Spam blacklist/help#SpamReportBot_reports

If the report contains links to less than 5 wikis, then only add it when it is really spam. Otherwise just revert the link-additions, and close the report, closed reports will be reopened when spamming continues.

Please place suggestions on the automated reports in the discussion section.

SpamReportBot is currently offline

For those of us who feel like pruning some old data: Category:Open XWiki reports. That category contains the reports with {{LinkStatus|Open}}. The open reports from USer:COIBot are listed below.

Running, will report a certain domain shortly after a link is used more than 2 times by one user on more than 2 wikipedia (technically: when more than 66% of this link has been added by this user, and more than 66% of this link were added XWiki). Same system as SpamReportBot (discussions after the remark "<!-- Please put comments after this remark -->" at the bottom; please close reports when reverted/blacklisted/waiting for more or ignore when good link)

List Last update By Site IP R Last user Last link addition User Link User - Link User - Link - Wikis Link - Wikis
vrsystems.ru 2023-06-27 15:51:16 COIBot 195.24.68.17 192.36.57.94
193.46.56.178
194.71.126.227
93.99.104.93
2070-01-01 05:00:00 4 4

Proposed removals

This section is for proposing that a website be unlisted; please add new entries at the bottom of the section. Remember to provide the specific URL blacklisted, links to the articles they are used in or useful to, and arguments in favour of unlisting. Completed requests will be marked as done or denied and archived. See also /recurring requests for repeatedly proposed (and refused) removals. The addition or removal of a link is not a vote, please do not bold the first words in statements.

natureperu.com

The link to the photos of coca tea, was the only one existent page of filtering products extracted of the coca sheet in the Coca Page. A problem for ignorance of wikipedia's procedure for some anonymous user cannot harm the publication of information that can interest many people. -- User:Jbricenol (talk) 27 May 2008 20:57 (UTC)

Please see User:COIBot/XWiki/natureperu.com and Talk:Spam_blacklist/Archives/2008/05#Natureperu.com.  – Mike.lifeguard | @en.wb 00:11, 28 May 2008 (UTC)Reply
The content of the pages is not enhanced by adding links to pictures of packs of tea, especially not commercial packs of tea (I quote: "Coca Tea Manufactured by Enaco S.A. and Sold by NaturePeru.com"). You clearly clicked your way through the interwiki links on coca, and added only this link, and I don't know which language 'et.wikipedia' is, but you decided not to translate but just copy and paste the section 'photos' in English.
One of these images could have been nice on the article 'tea pack', 'tea packaging' or something similar (if those articles exist), and maybe on 'coca tea', but the links are not adding anything. Not done. --Dirk Beetstra T C (en: U, T) 09:13, 28 May 2008 (UTC)Reply


Upon investigating this request, I found that there were additional, related domains not blacklisted at the time:


  • This domain has also been spammed




Additional accounts:








--A. B. (talk) 02:32, 1 June 2008 (UTC)Reply


Additional 3 domains Added Added --A. B. (talk) 01:53, 3 June 2008 (UTC)Reply

outrate.net

My site Outrate.net was blacklisted late in 2006 due to what was percieved as "linkspamming" as I'd added links to a number of wikipedia pages rapidly, without realising this was against policy.

The blacklist was upheld on further complaint that the site had excessive adult-oriented advertisements.

The advertisements have been completely removed from the site, which is a content rich site filled with hundreds of film reviews written speicifically for the site, interviews with celebrities conducted by the editor of the site and pulbished exclusively there, a short film festival hosted on the site, etc.

Pages from Outrate.net that belong as External links on wikipedia include interviews with figures like Billy Hayes, author, and so on , and these pages are what I'd like to add back in.

Is a removal of the blacklist on Outrate.net at all possible?

Mark

Unlisting is possible, but we generally do not remove links when requested by someone involved in the site. May I suggest that you contact a/some wikiproject(s) on some wikipedia (or other places where the use of your links could be discussed, for en wikipedia I would think about en:Wikipedia:WikiProject Films, I don't know how or what other wikis do), see if they deem the link useful, and then report back here (or ask an established user to request removal). --Dirk Beetstra T C (en: U, T) 13:02, 2 June 2008 (UTC)Reply

thank you for your reply. I don't know anyone who's a wikipedia user, or quite how to go about your recommendations above. Is there another option here? The preceding unsigned comment was added by 202.3.37.98 (talk • contribs) 13:47, 2 Jun 2008 (UTC)

There is a sense here in which you explain your own problem. You don't really know how Wikipedias work. If you have valuable knowledge on a subject you would be best getting involved on the pages relating to that subject on a language Wikipedia that you are fairly fluent in. Once you have become involved you will know other contributors & they will know you. It will then be possible for them to decide whether your site provides something that is of sufficient interest to warrant external links.
If you are unable to do that then I regret your site will remain on this list because, as said above, we do not remove sites at the request of those involved with them, sorry --Herby talk thyme 07:09, 3 June 2008 (UTC)Reply

Nothing further heard so closed as  Declined --Herby talk thyme 07:54, 7 June 2008 (UTC)Reply

podiatryworldwide.com

Hi,

My name is Pierre,

I made a neutral - non profit - scientifical - institutional website.

www.podiatryworldwide.com Global Podiatry Worldwide Directory

I made this websites in way to make what was missing on internet in podiatry speciality.

(In way to develop knowledge, this website simply try to rationnaly index, organize and reference all of the major scientifical Internet sites relating to podiatry by continent, country and specialty. Its intent is to develop contacts and links, to exchange information and to organize and standardize knowledge between all of the countries professionals of the world in this era of globalization)

I tried to begin to make external links through wikipedia by means of about 15 keywords in many languages relating to podiatry such as podiatry, orthapedic, foot ... but i was black listed.

I would like to know what would be possible to do. I would like to know what is acceptable or not. I would like to know what i can do to go on index on wikipedia.

Because if i dont do this on wikipedia, i ask myself where else i can do that but on wikipedia.

Thank you

Yes, well our projects are not link directories. You may find DMOZ useful - it is a directory of links. You may find them at http://www.dmoz.org/.  – Mike.lifeguard | @en.wb 00:12, 5 June 2008 (UTC)Reply
I should also point out that the domain is not blacklisted currently.  – Mike.lifeguard | @en.wb 00:14, 5 June 2008 (UTC)Reply

drupalmodules.com

See User:COIBot/XWiki/drupalmodules.com

Removed Removed --Dirk Beetstra T C (en: U, T) 09:35, 3 June 2008 (UTC)Reply

encyclopediadramatica\.(com|net|org)

Articles exist on

link is on the blacklist because of on-wiki politics at en, not because of spam. Ignoring the problem of using a spam blacklist for wiki-politics, the politics are over. The en ArbCom has said it is up to the community, and the community (on the talk page of the now existing article) has overwhelmingly said it should be linked to as standard practice. It should be removed from the blacklist and not returned. SchmuckyTheCat

Link was blacklisted by ArbCom in order to protect the project and its usership from harm, abuse and attacks. Attempting to claim it as "wiki-politics" is both dangerous and irresponsible. The "community" of which you speak is an isolated handful of editors who have persued interest in this particular topic, and in no way reflects the 700+ Wikimedia Foundations multi project usership. Because it has an article on the english whikipedia, does not give "carte blanche" for its whosale removal for Foundation wide indescriminant linking. Fails most every criteria for delisting consideration, including and not limited to;
http://www.encyclopediadramatica.com/Main_Page has been whitelisted by myself on en. However the arbcom ruling allowing for this home page linking is limmited within that article, Only. Arbcom in no way has sanctioned removal from the global blacklist. Perhaps requesting whitelisting of http://www.encyclopediadramatica.com/Main_Page on each of the individual wikis may be appropriate, however its wholsale removal is not. Each wiki using mediawiki software has a local whitelist. Only administrators of that wiki can modify their whitelist page. If you want a link added to the whitelist of a particular wiki, you should post a request on that wikis talk page --Hu12 04:06, 6 June 2008 (UTC)Reply
It should really be removed, because English ArbCom only has authority over the English Wikipedia, not over all of the 700+ wikis. If it is a problem on a single wiki, then they should use the local blacklist. Monobi (talk) 21:25, 6 June 2008 (UTC)Reply
A poor example, however here's a typical thing, regarding en Admin en:User:LaraLove;
  • Cut an paste
Use of shock or attack sites on any of the Wikimedia Foundations wikis has always been unacceptable regardless.--Hu12 22:19, 6 June 2008 (UTC)Reply
Actually, after thinking about it some more, this URL should probably be blacklisted, because I can imagine it being spammed on smaller wikis that don't have a local blacklist set up. If wikis want to link to it, they can deal with it locally. Also, wouldn't the regex have to be \.(com|net|org)* ? Meh, I dunno. Monobi (talk) 05:30, 7 June 2008 (UTC)Reply

This one has been extensively debated in the past. For the reasons above and the past discussions this is  Declined. Local whitelisting is perfectly possible & easy if teh local community are happy with it. Thanks --Herby talk thyme 06:53, 7 June 2008 (UTC)Reply


gemisimo.com

Our site was blacklisted almost a year ago and has been blacklisted ever since, I read extensively the correspondence on discussions here: [7], [8] I also read External_links#Links_to_be_considered and I don't think we fall into this category.

Moreover, we improved a lot throughout the year and now our site offers a large variety of professional articles relating to topics within wiki. We would like to ask you to consider the removal of the site from the blacklist on the basis of our improvements and relevance to the wikipedia project in relation to diamonds. For an example diamonds.gemisimo.com/en/Diamond-Project/Diamond-Basics/Diamond-Shapes.html of useful info that could be added to wiki after approval by wikipedians on the discussion for a specific article. This blacklist is really hurting our efforts and money investments into the site, we have been penalized and accept it but on the same note we believe we paid the price for one persons foolish mistake that obviously will never happen again. Thanks for the consideration --79.177.9.237 13:18, 8 June 2008 (UTC)Reply

Sorry, the fact that you seem mostly concerned about how "This blacklist is really hurting our efforts and money investments into the site" seems to make the point sufficiently.
Typically, we do not remove domains from the spam blacklist in response to site-owners' requests. Instead, we de-blacklist sites when trusted, high-volume editors request the use of blacklisted links because of their value in support of our projects. If such an editor asks to use your links, I'm sure the request will be carefully considered and your links may well be removed.
Until such time, this request is Declined. – Mike.lifeguard | @en.wb 19:13, 8 June 2008 (UTC)Reply
I respect your opinion and I understand your position and with respect to what you said I did show that the site is useful for wiki and could very easily be used in wiki articles related to diamonds and actually it would be possible to make some new articles using some verifiable information on the site. Please let me know what I can do to help wiki and contribute so you can see that I sincerely mean what I say 77.127.136.85 00:39, 9 June 2008 (UTC)Reply
My suggestion would be, try to contact some wikiprojects (on the English wikipedia, you would have to go to en:Wikipedia:WikiProject, other language wikipedia have similar systems, but they are probably called differently). There try to find an appropriate project, and discuss your information there. If there is a general consensus there that your site is indeed of interest, then one of the regular/established users (one that has no connection with the external site) there can come here and request unlisting and the regex will be removed promptly. I hope this explains. --Dirk Beetstra T C (en: U, T) 09:15, 9 June 2008 (UTC)Reply

Troubleshooting and problems

This section is for comments related to problems with the blacklist (such as incorrect syntax or entries not being blocked), or problems saving a page because of a blacklisted link. This is not the section to request that an entry be unlisted (see Proposed removals above).

Discussion

Looking ahead

"Not dealing with a crisis that can be foreseen is bad management"

The Spam blacklist is now hitting 120K & rising quite fast. The log page started playing up at about 150K. What are our options looking ahead I wonder. Obviously someone with dev knowledge connections would be good to hear from. Thanks --Herby talk thyme 10:46, 20 April 2008 (UTC)Reply

I believe that the extension is capable of taking a blacklist from any page (that is, the location is configurable, and multiple locations are possible). We could perhaps split the blacklist itself into several smaller lists. I'm not sure there's any similarly easy suggestion for the log though. If we split it up into a log for each of several blacklist pages, we wouldn't have a single, central place to look for that information. I suppose a search tool could be written to find the log entries for a particular entry. – Mike.lifeguard | @en.wb 12:24, 20 April 2008 (UTC)Reply
What exactly are the problems with having a large blacklist? --Erwin(85) 12:34, 20 April 2008 (UTC)Reply
Just the sheer size of it at a certain moment, it takes long to load, to search etc. The above suggestion may make sense, smaller blacklists per month, transcluded into the top level? --Dirk Beetstra T C (en: U, T) 13:16, 20 April 2008 (UTC)Reply
Not a technical person but the log page became very difficult to use at 150K. Equally the page is getting slower to load. As I say - not a techy - but my ideal would probably be "current BL" (6 months say) & before that? --Herby talk thyme 13:37, 20 April 2008 (UTC)Reply
I don't know how smart attempting to transclude them is... The spam blacklist is technically "experimental" (which sounds more scary than it really is) so it may not work properly. I meant we can have several pages, all of which are spam blacklists. You can have as many as you want, and they can technically be any page on the wiki (actually, anywhere on the web that is accessible) provided the page follows the correct format. So we can have one for each year, and just request that it be added to the configuration file every year, which will make the sysadmins ecstatic, I'm sure :P OTOH, if someone gives us the go-ahead for transclusion, then that'd be ok too. – Mike.lifeguard | @en.wb 22:12, 20 April 2008 (UTC)Reply
A much better idea: bugzilla:13805 bugzilla:4459 ! – Mike.lifeguard | @en.wb 01:43, 21 April 2008 (UTC)Reply
Just to note that my browser will no longer render the spam blacklist properly (though it's all there in edit view) - this is a real problem! – Mike.lifeguard | @en.wb 18:58, 14 May 2008 (UTC)Reply
Still there for me but "I told you so" :) By the time I'm back you will have got it all sorted out...... --Herby talk thyme 19:06, 14 May 2008 (UTC)Reply
My suggestion until then is to add another Spam blacklist 2 (needs configuration change). When logging, make sure you say which one you're adding to. We should possibly also split the current one in half so it will be easier to search and load. Configuration would look something like:
	$wgSpamBlacklistFiles = array(
		"DB: metawiki Spam_blacklist", //the current one
		"DB: metawiki Spam_blacklist_2" //the new one
		"DB: metawiki Spam_blacklist_3" //we can even have them configure an extra so that when #2 gets full-ish, we can just start on #3
	);
If we want to do this, it is an easy configuration change - took me <30s on my wiki. Using multiple blacklists is a pretty good solution until we can get a special page to manage the blacklist. The only downside I can see is you'd have to search more than one page to find a blacklist entry (but at least the page will load and render properly!) – Mike.lifeguard | @en.wb 16:54, 15 May 2008 (UTC)Reply

I'm thinking of writing a new extension which works based on a real interface, and allows much better management. Werdna 08:15, 16 May 2008 (UTC)Reply

Until then, I would suggest to go with Mike.lifeguard's suggestion. What about splitting of the old part? Or the unlogged part into a 'old' spam blacklist, and for the active cases to work with the normal, current blacklist. In that case there is no confusion where to add to, and things render properly. The old one is only edited when deleting an entry. --Dirk Beetstra T C (en: U, T) 08:51, 16 May 2008 (UTC)Reply
What is the 'old' part? I think it's a good idea having one active list. In any case there should be some logic in splitting the lists, so choosing what list to add to won't be arbitrary. --Erwin(85) 11:17, 16 May 2008 (UTC)Reply
It is a good idea, but only if it works. Right now it doesn't work for me, so I see this as a problem that needs fixing sooner rather than later. I think Beetstra's method of splitting it up would be fine. – Mike.lifeguard | @en.wb 17:20, 16 May 2008 (UTC)Reply

Requested as bugzilla:14322 because this is just getting silly.  – Mike.lifeguard | @en.wb 22:30, 28 May 2008 (UTC)Reply


LinkWatchers

Database conversions

I am working on both loading the old database (about 4 months worth of links) and to rebuild the current database (about 5-6 weeks worth of links) into a new database.

  • The old database is in an old format, and has to be completely reparsed..
  • The new database had a few 'errors' in it, and I am adding two new fields.

I am running through the old databases by username, starting with aaa.

As a result the new database does not contain too much data yet, and will be 'biased' towards usernames early in the alphabet.

This process may take quite some time, maybe weeks, as I have to throttle the conversion to keep the current linkwatchers 'happy' (they are still running in real-time). These linkwatchers are also putting their data into this new database, so at everything after about 18:00, April 29, 2008 (UTC) is correct and complete.

The new database contains the following data, I will work later on making that more accessible for on-wiki research:

  1. timestamp - time when stored
  2. edit_id - service field
  3. lang - lang of wiki
  4. pagename - pagename
  5. namespace - namespace
  6. diff - link to diff
  7. revid - the revid, if known
  8. oldid - the oldid, if any
  9. wikidomain - the wikidomain
  10. user - the username
  11. fullurl - the full url that was added
  12. domain - domain, indexed and stripped of 'www.' -> www.example.com becomes com.example.
  13. indexedlink - rewrite of the fullurl, www.example.com/here becomes com.example./here
  14. resolved - the IP for the domain (new field, and if found)
  15. is it an ip - is the edit performed by an IP (new field)

I'll keep you posted. --Dirk Beetstra T C (en: U, T) 10:31, 30 April 2008 (UTC)Reply

Well .. keep people posted:

We had two different tables, linkwatcher_linklog and linkwatcher_log. The former in a reasonable new format, the old one in a very old, outdated format.

The table linkwatcher_linklog is being transferred into linkwatcher_newlinklog, and when a record is converted, it is moved to a backup table (linkwatcher_linklogbackup). That conversion is at about 31% now.

The table linkwatcher_log is completely reparsed, and when a record is converted, the record is transferred into linkwatcher_logbackup. Also for this table the conversion is about 29%.

All converted data goes into linkwatcher_newlinklog, as does the data that is currently being recorded by the linkwatcher 'bots'.

  • linkwatcher_linklog - 1,459,158 records - 946.7 MB
  • linkwatcher_linklogbackup - 646,127 records - 323.5 MB
  • linkwatcher_log - 1,526,600 records - 1.0 GB
  • linkwatcher_logbackup - 628,575 records - 322.8 MB
  • linkwatcher_newlinklog - 2,152,052 records - 1.1 GB

Still quite some time to go. The conversion of linkwatcher_linklog is at usernames starting with 'fir', linkwatcher_log is at usernames starting with 'jbo'. --Dirk Beetstra T C (en: U, T) 20:54, 9 May 2008 (UTC)Reply

Update:

  • linkwatcher_linklog - 761,429 records - 946.7 MB
  • linkwatcher_linklogbackup - 1,036,146 records - 525.7 MB (58% converted)
  • linkwatcher_log - 1,159,216 records - 1.0 GB
  • linkwatcher_logbackup - 995,959 records - 501.8 MB (46% converted)
  • linkwatcher_newlinklog - 3,448,605 records - 1.8 GB

linklog is at "OS2", log is at "Par". --Dirk Beetstra T C (en: U, T) 15:01, 16 May 2008 (UTC)Reply

Update (the first one is starting to convert IPs):

  • linkwatcher_linklog - 303,562 records - 946.7 MB
  • linkwatcher_linklogbackup - 1,494,013 records - 754.8 MB (83% converted)
  • linkwatcher_log - 732,773 records - 1.0 GB
  • linkwatcher_logbackup - 1,422,402 records - 711.0 MB (66% converted)
  • linkwatcher_newlinklog - 5,062,298 - 2.6 GB

linklog is at '199', log is at 'xli' (had to take one down for some time, too much work for the box). --Dirk Beetstra T C (en: U, T) 19:19, 25 May 2008 (UTC)Reply

Whee! One database has been converted (the bot quit just minutes ago):

  • linkwatcher_linklog - 0 records - 946.7 MB
  • linkwatcher_linklogbackup - 1,797,575 records - 912.5 MB
  • linkwatcher_log - 447,802 records - 1.0 GB
  • linkwatcher_logbackup - 1,707,373 records - 836.6 MB
  • linkwatcher_newlinklog - 6,038,610 - 3.1 GB

The other bot is now at 79%, at '59.' (somewhere in the IP-usernames). Getting there! --Dirk Beetstra T C (en: U, T) 11:49, 30 May 2008 (UTC)Reply

The bots have finished this job. The table 'linkwatcher_newlinklog' contains at the time of this post 7,012,340 records. The database is now 'complete' from approx. 2007-09-01 on (except for some bot downtime, which may be up to several days in total). Bots are now working on getting the time into UTC (see below) and I am working on a bot that fills up the gaps, and that can also parse the periods before the linkwatchers started, or can parse wikis that we excluded from the linkwatchers. --Dirk Beetstra T C (en: U, T) 10:51, 7 June 2008 (UTC)Reply

Timestamp

Timestamp is from now on stored as the UTC time of the edit. I will start updating the old records later. --Dirk Beetstra T C (en: U, T) 19:22, 3 June 2008 (UTC)Reply

Thanks for that. Did you try linking to the domain yet? --Erwin(85) 20:01, 3 June 2008 (UTC)Reply
No, I see on en sometimes people blanking the report because they disagree, COIBot would then not be able to save an update, since the blacklisted link is in there. It should work normally, but with that problem, it would go wrong.
For those records where the time has been converted to UTC, the bot adds ' (UTC)' to the timestamp. Take all other timestamps with a grain of salt, there are quite some which are wrong. A bot is working on it to update them, but due to insanity of the programmer of the bots, I have had to restart that program a couple of times (I suspect about 200.000 records are 'wrong' out of the 6.5 million in the database). --Dirk Beetstra T C (en: U, T) 10:31, 5 June 2008 (UTC)Reply

More statistics ... ????

I now have a fairly complete database with a lot of links (we are hitting 7 million records in a couple of hours ..), and I could get a lot of statistics from that. The current table has the following fields:

  1. timestamp - UTC Time (subject to updating)
  2. edit_id - service field
  3. lang - lang of wiki
  4. pagename - pagename
  5. namespace - namespace
  6. diff - link to diff
  7. revid - the revid, if known
  8. oldid - the oldid, if any
  9. wikidomain - the wikidomain
  10. user - the username
  11. fullurl - the full url that was added
  12. domain - domain, indexed and stripped of 'www.' and 'www#.' -> 'www.example.com' becomes 'com.example.'
  13. indexedlink - rewrite of the fullurl, www.example.com/here becomes com.example./here
  14. resolved - the IP for the domain (new field, and if found)
  15. is it an ip - is the edit performed by an IP (new field)

NOTES:

  • 'resolved' is the IP of the server the website is hosted on. Simply: A webserver is a computer, and that computer, when connected to the internet, has an IP. When you request a webpage, a 'nameserver' converts the name of the domain to the IP of the computer the site is hosted on, and then requests the 'computer with that IP' to send you the webpage referred to as 'domain'. In other words, if you have a computer, and run a webserver, you can register a large number of domains and host them on your computer. If you then are a spammer, you can use a large number of domains (which might prevent detection), but all these websites will still have the same 'resolved' IP!
  • we are converting the normal domain to the indexed one to make it searchable in a quicker way (MySQL reasons).

At the moment the bots calculate:

  1. how many links did this user add
  2. how often did this domain get added
  3. how often did this user add this domain
  4. to how many wikis did this user add this domain

But I also have the possibility to calculate:

  1. how many users that added this link were not using a user account.
  2. on how many computers are the websites that this user adds hosted.
  3. how often did domains that are hosted on 'this' computer (for a certain domain) get added.
  4. how often did this user add domains hosted on 'this' computer (for a certain domain).
  5. to how many wikis did this user add domains hosted on 'this' computer.

etc. etc.

The biggest problem is .. how to organise that information, and how to make that available to you here. But if there are statistics that would improve (y)our effords here greatly, please let me know. --Dirk Beetstra T C (en: U, T) 10:38, 6 June 2008 (UTC)Reply

Excluding our work from search engines

This is a bigger problem for enwiki than for us, but still... I'd like to ask that subpages of this page be excluded from indexing via robots.txt so we do not receive complaints about "You're publicly accusing us of spamming!" and the like. These normally end up in OTRS, where it is a waste of volunteers' time and energy. The MediaWiki search function is now good enough that we can use it to search this site for a domain rather than relying on a google search of meta (ie in {{LinkSummary}} etc). As well, we'll include the subpages for COIBot and LinkReportBot reports. – Mike.lifeguard | @en.wb 20:05, 10 May 2008 (UTC)Reply

I have made a bug for this: bugzilla:14076. – Mike.lifeguard | @en.wb 20:10, 10 May 2008 (UTC)Reply
Good idea. There's no need for these pages to be indexed. --Erwin(85) 08:06, 12 May 2008 (UTC)Reply
Mike, I strongly agree with excluding crawlers from our bot pages.
I very much disagree, however, with excluding crawlers from this page and its archives. In many cases, seeing their name in a Google search is the first time at least half our hard-core spammers finally take us seriously. Since they usually have other domains we're unaware of, this deters further spam.
If domain-owners feel wronged about entries on this page, overworked OTRS volunteers should feel free to direct them to the removals section here and we can investigate. In my experience, many Wikipedia admins and editors don't enough experience with spam to know how to investigate removal requests and separate the sheep from the goats.
If there's been a false report, we can move the entries from our crawlable talk archives to a non-crawlable subpage (call it "false positives").
I'll also note that I think we've been getting more false positives blacklisted since we got these bot reports. I continue to feel strongly that we must be very conservative in blacklisting based on bot reports. If a site's been spammed, but it looks useful, wait until some project complains or blacklists it. Even if a site's been spammed and doesn't look useful, if the spammer hasn't gotten enough warnings then we shouldn't blacklist it unless we know he fully understands our rules -- that or we get a complaint from one of our projects. Perhaps we should have our bots issue multi-linual warnings in these cases.
As for the spammer that's truly spammed us in spite of 4 or more warnings, I don't care if he likes being reported as a spammer or not. I've spent almost two years dealing with this subset of spammers and they're going to be unhappy with us no matter what we do until they can get their links in. --A. B. (talk) 13:39, 12 May 2008 (UTC)Reply
I agree with you that the bot reports probably have a threshold that are too low - perhaps that can be changed. Until such time, they need to be handled carefully.
The problem I am talking about is not false positives. Those are very straightforwardly dealt with. The real problem is not with people emailing us to ask to get domains de-listed, but rather with people emailing us demanding that we stop "libeling" them. What they want is for us to remove all references to their domain so their domain doesn't appear in search results next to the word "spam". Well, I'm not prepared to start blanking parts of the archives to make them stop whining - we need these reports for site maintenance. Instead we can have our cake and eat it too: don't allow the pages to be indexed, and keep the reports as-is. I'm not sure I see how having these pages indexed deters spammers. What I do see is lots of wasted time dealing with frivolous requests, and a way to fix that.
Just a reminder to folks that we should discuss this here, not in bugzilla. I am closing the bug as "later" which I thought I had done earlier. Bugzilla is for technical implementation (which is straightforward); this space is for discussion. They will yell at us if we spam them by discussing on bugzilla :) – Mike.lifeguard | @en.wb 16:16, 12 May 2008 (UTC)Reply
If a site-owner has truly spammed us in spite of repeated warnings, then we are not libeling them if search engines pick up their listings here. I've dealt with such complaints before; I point out the clear evidence that's a matter of public record: warnings, diffs and our rules. I tell them if they find any factual inaccuracies in that record to let us know and we'll fix it immediately. I'm happy to discuss these blacklisting decisions with site-owners that bring them to the removals section.
Wikimedia's organization and servers are based in the United States; libel cases there are very difficult to pursue. A fundamental tenet there is that truth is an absolute defense. If our records are true and spammers have been previously warned and apprised of our rules, then they don't have a leg to stand on in that jurisdiction.
Servers may be based in the US, but they are accessible in the world. Thus you're liable in other courts, like say the UK (unless wikia's servers are in NY[9], whose law is questionable at best). In the UK just getting your case defended will cost you $200,000-plus up front, and much more if you lose - and if you are not represented, judgment will be entered against wikia in default. [10]
~ender 2008-05-18 11:04:AM MST
As for deterrence, periodically perusing hard core spammer forums like seoblackhat.com and syndk8.net as well as more general SEO forums like forums.digitalpoint.com will show lively discussions as to whether spamming us is worth the risks and aggravation. The more negative the chatter there about "wiki link-nazis", the better off our projects are.
My sense is that the volume of complaints has grown recently as we've blacklisted more domains based on bot reports. Some of these site-owners may not be truly innocent but they're not hard-core and haven't been warned sufficiently. Blacklisting comes as a large, alarming shock to them.
--A. B. (talk) 17:03, 12 May 2008 (UTC)Reply
Of course it's not real libel. But that doesn't stop them from complaining, which is a waste of time. And a needless waste of time when it can be so easily stopped. On the other hand, I do see your point about being perceived as the link Nazis. Perhaps someone other than the three of us can share some thoughts?
Also, you're assuming there's a presumption of innocence. In other legal jurisdictions (and for a number of things in the US) that is not the case. You will have to prove that it is not libel.
~ender 2008-05-18 11:09:AM MST
I agree with excluding these pages from search engines. While Meta is nowhere near as high on searches as enwiki, it won't take much. Cary Bass demandez 23:18, 13 May 2008 (UTC)Reply

I also agree, although I do believe that the fact that these reports rank so high works preventive. Finding these reports of other companies should stop other companies from doing the same. But the problems with the negative impact these reports may have on a company (although that is also not our responsibility, when editing wikipedia they get warned often enough that we are not a vehicle for advertising!) I think it is better to hide them, especially since our Bots are not perfect and sometimes pick up links wrongly, as well as we are only human and may make mistakes in reporting here as well. --Dirk Beetstra T C (en: U, T) 09:45, 14 May 2008 (UTC)Reply

Dirk, I think the way to "have our cake and eat it" is to have this talk page and its archives crawlable and just exclude the bot pages. As for mistakes here, I don't see many true mistakes in human-submitted requests that actually get blacklisted. I at least skim at almost all the blacklist removal and whitelist requests both here and on en.wikipedia. The most common mistake humans make is to unknowingly exclude all of a large hosting service (such as narod.ru) instead of just the offending subdomain; I have yet to see a large hosting service complain. Otherwise, >>90% of our human-submitted requests nowadays have been pretty well thrashed out on other projects first before they even get here. Of those that are still flawed, they either get fixed or rejected with a public discussion. It's not that humans are so much smarter than bots, but rather, like so many other things on wiki, after multiple edits and human editors, the final human-produced product is very reliable.
I spend several hours a month reading posts on several closed "black hat" SEO forums I'm a "member" of. The reliability, finality and public credibility of our spam blacklisting process bothers a lot of black hats. I think our goal should be to keep keep this going. --A. B. (talk) 12:30, 14 May 2008 (UTC)Reply
The point isn't whether we're right or wrong. The point is whether they complain and waste our time or not. It seems to me that looking like link Nazis publicly is not a very strong rationale if it conflicts with the goal of doing the work without wasted time. That said, views on this may differ. – Mike.lifeguard | @en.wb 16:01, 14 May 2008 (UTC)Reply
Would it be too radical if we ask for a way to choose which pages shouldn't be indexed, onwiki? An extension can be easily created and installed for Meta which lets us add a <noindex /> tag to the top of the article, and cause an appropriate "noindex" meta tag to be added to the HTML output, and thus prevent that page (and only that page) from being indexed. How do you feel about my raw idea? Is it too prone to being abused? Huji 13:38, 15 May 2008 (UTC)Reply
Sounds good, but that would, if enabled on en, give strange vandalism I am afraid. I would be more inclined in an extension which listens to a MediaWiki page, where pages can be listed which should not be indexed (that should include their subpages as well). Don't know how difficult that would be to create. --Dirk Beetstra T C (en: U, T) 16:25, 15 May 2008 (UTC)Reply



Useful investigation tool: url-info.appspot.com

In investigating User:COIBot/XWiki/url-info.appspot.com, I found that the linked site provides a small browser add-on that's a very useful tool for investigating all the links embedded in a page, whether it's a Wikimedia site page or an external site. I recommend others active with spam mitigation add it to their browser toolbar and check it out. This could have saved me many hours in the last year:

If you're trying to find possible related domains links on a spam site to investigate, this will quickly list them all, sparing the aggravation of clicking on every link. In fact it's so easy to glean information that we'll need to ensure we're not mindlessly reporting unrelated domains as "related" when they've appeared on a spam site page for some innocent reason:

As for the spam report for this domain, I don't think the extent of COI linking (just 1 link to each of 4 projects) currently meets the threshold for meta action; local projects can deal with this as they see fit. The tool is free and the page has no ads.

Note that appspot.com, the underlying main domain, is registered to Google for users of its App Engine development environment. --A. B. (talk) 14:22, 21 May 2008 (UTC)Reply

It could be useful. It simply lists external links though, so like you said I guess most of 'm have nothing to do with the site. Note that the add-on is a w:en:bookmarklet, so not an extension or something. --Erwin(85) 16:06, 21 May 2008 (UTC)Reply


Thresholding the xwiki

The linkwatchers now calculate for each link addition the following 4 values (all according to database filling):

  1. UserCount - how many external links did this user add
  2. LinkCount - how often is this external link added
  3. UserLinkCount - how often did this user add this link
  4. UserLinkLangCount - to how many wikipedia did this user add this link.

The threshold was first:

if ((($userlinklangcount / $linkcount) > 0.90)  && ($linkcount > 2) && ($userlinklangcount > 2)) {
   report
}

I noticed, that when one user performs two edits on the first linkaddition in one wiki, and then starts adding to other wikis as well, that the user gets reported at edit 11, which I found way too late:

  • 3/3
  • 10/11
  • 11/12

earlier/inbetween combinations are not passing that threshold ..

The code is now:

if ((($userlinkcount / $linkcount) > 0.66)  && (($userlinklangcount / $linkcount) > 0.66 ) && ($userlinklangcount > 2)) {
  report
}

This is (userlink/link & wikis/link):

  • 3/3 & 3/3
  • 3/4 & 3/4
  • 4/5 & 4/5
  • 5/6 & 5/6
  • 5/7 & 5/7
  • 6/8 & 6/8

etc.

I am thinking to also do something like ($userlinkcount < xxx), which should take out some more established editors, xxx being .. 100 (we had a case of one editor adding 20 links in one edit .. you need to be hardcore to escape 100, adding 34 links every edit)?

I want to say here, the threshold is low, and maybe should be. Cleaning 10 wikis when crap is added is quite some work, it is easier to close/ignore one where only 4 wikis were affected. I will let this run, see what happens, this may give a lot more work, in which case I am happy to put the threshold higher.

Comments? --Dirk Beetstra T C (en: U, T) 15:15, 22 May 2008 (UTC)Reply

I feel like this is too low. Most reports are simply closed with "reverted" as they're not enough to warrant blacklisting. As such, I'm not sure how useful that is. Not sure how the math should change, but there must be a happy medium.  – Mike.lifeguard | @en.wb 14:23, 3 June 2008 (UTC)Reply
If they need reverting than I think it is OK. If I make it (e.g.) 5 then the 3 edits that are questionable did not get reverted. --Dirk Beetstra T C (en: U, T) 14:28, 3 June 2008 (UTC)Reply

Of possible interest to people here

I've detected a rise in new approaches to promotional activity involving Commons. I've posted there & others may wish to read/look. Thanks --Herby talk thyme 11:10, 31 May 2008 (UTC)Reply

COIBot

poking COIBot

I notice that sometimes people who are not active on IRC need some link reports. Admins here can now add {{LinkSummary|domain}} to User:COIBot/Poke, when COIBot picks up the edit to that page (and it should), it will put the domains into its reporting queue (high priority, which is, only behind waiting XWiki reports) and create a report on the link(s). The first report should be saved within about 5 minutes, if it takes longer than 15 minutes there is probably something wrong, and it may be useful to add the template with the link again (it reads the added part of the diffs (the right column)), or poke me or another person who is active on IRC personally. Hope this is of help. --Dirk Beetstra T C (en: U, T) 12:46, 4 June 2008 (UTC)Reply

P.S. Please don't overuse the functionality, everything still needs to be saved. --Dirk Beetstra T C (en: U, T) 12:54, 4 June 2008 (UTC)Reply
It had some startup problems, but all seems to work fine now. --Dirk Beetstra T C (en: U, T) 17:28, 4 June 2008 (UTC)Reply

Sorting - UTC

The COIBot reports now are/should be sorted by time, newest records at the bottom. The newer records are now stored in UTC, and there is a bot busy with converting the time of the old records to UTC. When the time is in UTC, it will show ' (UTC)' behind the timestamp. --Dirk Beetstra T C (en: U, T) 13:56, 5 June 2008 (UTC)Reply

Random (junk?) thought

It went here! (it's a wiki - if anyone disagrees...) --Herby talk thyme 09:12, 9 June 2008 (UTC)Reply