Jump to content

Talk:Spam blacklist

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Herbythyme (talk | contribs) at 08:10, 27 May 2008 (→‎Turkish real estate crossi wiki spam: yep). It may differ significantly from the current version.

Latest comment: 16 years ago by Herbythyme in topic Proposed additions
Shortcut:
WM:SPAM
The associated page is used by the Mediawiki Spam Blacklist extension, and lists strings of text that may not be used in URLs in any page in Wikimedia Foundation projects (as well as many external wikis). Any meta administrator can edit the spam blacklist. There is also a more aggressive way to block spamming through direct use of $wgSpamRegex. Only developers can make changes to $wgSpamRegex, and its use is to be avoided whenever possible.

For more information on what the spam blacklist is for, and the processes used here, please see Spam blacklist/About.

Please post comments to the appropriate section below: Proposed additions, Proposed removals, or Troubleshooting and problems, read the messageboxes at the top of each section for an explanation. Also, please check back some time after submitting, there could be questions regarding your request. Per-project whitelists are discussed at MediaWiki talk:Spam-whitelist. In addition to that, please sign your posts with ~~~~ after your comment. Other discussions related to this last, but that are not a problem with a particular link please see, Spam blacklist policy discussion.

Completed requests are archived (list, search), additions and removal are logged.

snippet for logging: {{/request|1012405#{{subst:anchorencode:section name here}}}}

If you cannot find your remark below, please do a search for the URL in question with this Archive Search tool.

Spam that only affects single project should go to that project's local blacklist

Proposed additions

This section is for proposing that a website be blacklisted; add new entries at the bottom of the section, using the basic URL so that there is no link (example.com, not http://www.example.com). Provide links demonstrating widespread spamming by multiple users. Completed requests will be marked as done or denied and archived.


referenceforbusiness.com

Please consider blacklisting referenceforbusiness.com. It's the sister site of previously blocked stateuniversity.com block discussion here, owned by Advameg Advameg discussion here. A search on referenceforbusiness.com yielded its insertion into over 20 articles on the English-language Wikipedia, including Company American Pop Corn Company, Frederick W. Smith, Dwight Schar, and List of acquisitions by Symantec. I have not found a specific spam user, but I think instead the site is typically added as a reference by editors who are adding cites to articles and come across referenceforbusiness.com as a Google result for obscure topics. --Zippy 05:55, 29 April 2008 (UTC)Reply

If this is a solely en wp issue then the blacklisting should be local (here) thanks --Herby talk thyme 16:35, 29 April 2008 (UTC)Reply
Thank you for the suggestion - I was not aware of the en.wikipedia list. I did a check, and links to referenceforbusiness.com also appear on de.wikipedia, nl.wikipedia, fr.wikipedia, it.wikipedia, ja.wikipedia, and zh.wikipedia, albeit to a reduced degree compared to en.wikipedia. (links go to search results) --Zippy 22:30, 29 April 2008 (UTC)Reply


175 links on top 10 wikis (here). Certainly needs looking at - out of time myself now --Herby talk thyme 15:50, 30 April 2008 (UTC)Reply

This type is really difficult to deal with. There is certainly some valid use of the link, there is some excessive use of the link.
If we list it some valid editors will be affected, if we don't it is likely the linkage will grow & grow. Other views very welcome, thanks --Herby talk thyme 08:02, 2 May 2008 (UTC)Reply
I can see RfB as being a valid use as in "a user in good faith adds it as a citation for an article," but I don't think referenceforbusiness.com is ever valid as a reference, due to verifiability and reliability concerns. I'm having a hard time coming up with a case where I'd not remove it as a link. --Zippy 11:36, 2 May 2008 (UTC)Reply
I'm moving that way I think. I can see why it might be an idea to use it but equally I would have thought any really important stuff should be found elsewhere? Those interested may want to look here as well. --Herby talk thyme 11:51, 2 May 2008 (UTC)Reply
All over, there are discussions on talk pages about the suitability of this domain as a reference. I don't think it is ok for that purpose. But I'm not sure that necessitates blacklisting globally. – Mike.lifeguard | @en.wb 12:32, 2 May 2008 (UTC)Reply
After looking at some more uses, and looking into the background discussion some more, I'm thinking this may be similar to the case of ezinearticles.com - there might be some good content in there, but it is not a reliable source on the whole. I'd want some more input before proceeding. – Mike.lifeguard | @en.wb 01:48, 14 May 2008 (UTC)Reply


anti-jw.chat.ru



Pages replaced with that site by



on af, ast,

The spamlink had been temporaily added to the bl to stop him, (he obviously just started at "a")

\banti-jw\.chat\.ru\b

Probably it should stay on the bl, best regards, --birdy geimfyglið (:> )=| 21:43, 19 May 2008 (UTC)Reply

(note, I have not yet added it to the log, --birdy geimfyglið (:> )=| 22:22, 19 May 2008 (UTC))Reply


Given that it seemed to have happened not only on (af) and (ast), (en),(es), and the fact that he obviously (seen on via the big brother channel) just started at a to replace all articles with that link, I think the link should stay blacklisted, I will add it to the log now. Thanks, --birdy geimfyglið (:> )=| 17:05, 20 May 2008 (UTC)Reply

Sounds good to me. Thanks birdy!  – Mike.lifeguard | @en.wb 17:09, 20 May 2008 (UTC)Reply

Natureperu.com



Placed on articles about Coca by:



.Koen 15:54, 20 May 2008 (UTC)Reply

Caught by the linkwatchers as well, covered in User:COIBot/XWiki/natureperu.com. Suggest discussing there, more info there. Good catch, .Koen! --Dirk Beetstra T C (en: U, T) 15:58, 20 May 2008 (UTC)Reply
Added Added. --Dirk Beetstra T C (en: U, T) 16:06, 20 May 2008 (UTC)Reply

spirited-expeditions.com



Repeated cross-wiki spamming in articles about Lago de Atitlán.

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] --87.178.150.157 22:34, 20 May 2008 (UTC)Reply

Thanks. Added Added  – Mike.lifeguard | @en.wb 22:40, 20 May 2008 (UTC)Reply

searchx.1gb.in

This was inserted only a few places as far as I know, but it was pages like chr:Wikipedia:How to edit a page and simple:Wikipedia:How to copy-edit. The website is of no value to any Wikimedia wiki.







--Jorunn 08:21, 22 May 2008 (UTC)Reply

Added Added \bsearchx\.1gb\.in\b  – Mike.lifeguard | @en.wb 16:16, 22 May 2008 (UTC)Reply


geocities.com/satanismresource

Domains




Accounts

















Reference

--A. B. (talk) 03:42, 25 May 2008 (UTC)Reply

Added Added --A. B. (talk) 03:48, 25 May 2008 (UTC)Reply


banknotes.com spam

Spamming wikipedia since July 2005.


Domains





















Accounts







































References

--A. B. (talk) 04:26, 25 May 2008 (UTC)Reply

Added Added --A. B. (talk) 17:34, 25 May 2008 (UTC)Reply

christianthomaskohl pseudoscience vanispamcruftisement





A couple of cross-wiki spammers, there's a whole bunch of one-wiki spammers at the WT:WPSPAM link.






Full report at w:WT:WPSPAM#christianthomaskohl (permanent link), see also User:COIBot/LinkReports/ctkohl.googlepages.com and User:COIBot/LinkReports/christianthomaskohl.googlepages.com. Pseudoscience, blatant self-promotion. 143.238.211.63 11:34, 26 May 2008 (UTC)Reply

Agree with the proposed addition. Adambro 12:30, 26 May 2008 (UTC)Reply

Turkish real estate crossi wiki spam



Works via dynamic turkish IP-addies. Per wikiversion the language version of the site, f.e. on wiki-nl it is http: // www.antalyahomes.com/default.asp?lang=nl are added. Kind regards, MoiraMoira 07:53, 27 May 2008 (UTC)Reply

Agreed, cross wiki & active. Cleaned & Added Added thanks --Herby talk thyme 08:10, 27 May 2008 (UTC)Reply

Proposed additions (Bot reported)

This section is for websites which have been added to multiple wikis as observed by a bot.

Items there will automatically be archived by the bot when they get stale.

Sysops, please change the LinkStatus template to closed ({{LinkStatus|closed}}) when the report is dealt with, and change to ignore for good links ({{LinkStatus|ignore}}). More information can be found at User:SpamReportBot/cw/about

These are automated reports, please check the records and the link thoroughly, it may be good links! For some more info, see Spam blacklist/help#SpamReportBot_reports

If the report contains links to less than 5 wikis, then only add it when it is really spam. Otherwise just revert the link-additions, and close the report, closed reports will be reopened when spamming continues.

Please place suggestions on the automated reports in the discussion section.

NOTE: Please be aware that the linkwatcher databases are being regenerated. They are roughly correct since April 29, 2008, around 18:00 UTC (except for some minor bot downtime). Database before that is biased towards usernames in the beginning of the alphabet, IP users have not been converted yet.

SpamReportBot is currently offline

For those of us who feel like pruning some old data: Category:Open XWiki reports. That category contains the reports with {{LinkStatus|Open}}. The open reports from USer:COIBot are listed below.

Running, will report a certain domain shortly after a link is used more than 2 times by one user on more than 2 wikipedia (technically: when more than 66% of this link has been added by this user, and more than 66% of this link were added XWiki). Same system as SpamReportBot (please close reports when reverted/blacklisted/waiting for more or ignore when good link)

List Last update By Site IP R Last user Last link addition User Link User - Link User - Link - Wikis Link - Wikis
vrsystems.ru 2023-06-27 15:51:16 COIBot 195.24.68.17 192.36.57.94
193.46.56.178
194.71.126.227
93.99.104.93
2070-01-01 05:00:00 4 4

Proposed removals

This section is for proposing that a website be unlisted; please add new entries at the bottom of the section. Remember to provide the specific URL blacklisted, links to the articles they are used in or useful to, and arguments in favour of unlisting. Completed requests will be marked as done or denied and archived. See also /recurring requests for repeatedly proposed (and refused) removals. The addition or removal of a link is not a vote, please do not bold the first words in statements.


www.aluminiumleader.com

This site was blacklisted for being added to such pages as aluminium, bauxite, ruby etc. However, it should be pointed out that aluminiumleader.com is a web resource dedicated especially to aluminium - the history of the metal, its invention, ways of production and utilization. Aluminium is produced from alumina and before that from bauxite. Aluminium oxides are rubies, sapphires and other precious and semi-procious stones mention in the part MINERALS on the site. That is why the links on this web site were added to the respective articles. It is an informative, encyclopaedic, interactive resource in Russian and English without any ads or promotions. Therefore, the site should be removed from the spamlist. LOscritor 06:02, 20 May 2008 (UTC)Reply

Note for other editors: The site was added to the blacklist by Beetstra on 14 April 2008. It appears to have been in response to the report User:COIBot/LinkReports/aluminiumleader.com. Adambro 06:13, 20 May 2008 (UTC)Reply
The site was added (and mainly, the main domain, and sometimes readded) to quite a number of wikipedia (10-15?) by an ip (213.248.20.174) to external link sections (diff, diff ..). I concur, this site may be of interest on the russian wikipedia (the main language for the site), and maybe on en for certain parts of the data (though I am sure that there are non-commercial sites that have the same information - "Project of UC RUSAL, the world’s largest producer of aluminium and alumina."). For all the other languages its use is even more questionable. I would suggest declining to remove this site, whitelisting the domain on the russian wikipedia, whitelisting can be discussed on en as well (though that may only be for one or two pages on the whole server, flash content is discouraged on en), for other wikis also specific urls may be whitelisted, but the domain is unnecessery (we are writing an encyclopedia, not a linkfarm or an internet directory). --Dirk Beetstra T C (en: U, T) 10:50, 20 May 2008 (UTC)Reply
I have to agree here. Whitelist it where needed, but there's no shortage of other (better) sites for that info, and this one was spammed.  Declined  – Mike.lifeguard | @en.wb 15:28, 20 May 2008 (UTC)Reply
  • I have to disagree with a number of arguments against whitelisting:

1) "it is mainly for Russian Wiki" - The versions in Russian and in English are equal containing the same information. The discussed web site covers the activities of all major aluminium producers: American Rio Tinto Alcan and Alcoa, Australian BHP Billiton, Chinese Chalco and Chinalco whose working language is English. That is why news provided by Dow Jones subscription are of especial importance to aluminium producers and journalists who specialize on metals and mining. Therefore the link should be returned to the English Wiki at least 2) "Project of UC RUSAL, the world’s largest producer of aluminium and alumina" - all scientific and educational projects need sponsors. You do not delete the web site of the Bolshoi Theatre (www.bolshoi.ru) from the respective article only because it has a large banner of its sponsor on top of its home page. 3)"but there's no shortage of other (better) sites for that info" - could you, please, indicate any other web resource containing as comprehensive information as the site under discussion in terms of history of the metal, ways of its production and use, as well as latest news of the aluminium industry and market indices?LOscritor 11:47, 21 May 2008 (UTC) The preceding unsigned comment was added by 213.248.20.174 (talk • contribs) 11:49, 21 May 2008 (UTC)Reply



In particular Luxo is worth a look. Spamming is not about content - it is about excessive link placement (in this case across wikis) with the intention of promoting a website & with little else in the way of positive contributions. This discussion & request is now closed thanks --Herby talk thyme 12:00, 21 May 2008 (UTC)Reply

http://www.iupac.org/reports/periodic_table/index.html, http://www.webelements.com/webelements/elements/text/Al/index.html, http://www.infomine.com/commodities/aluminum.asp, http://www.britannica.com/eb/art-64454, http://www.indexmundi.com/en/commodities/minerals/aluminum/aluminum_table12.html, plus all peer-reviewed articles about the subject as e.g. published by e.g. Elsevier, Wiley (and all other publishers), and e.g. the ACS, RSC (and other chemical societies around the world). Sources enough. The only difference, this link was pushed to several wikis, and indeed may only be of interest on the Russian, and maybe the English wikipedia . I do see this site has good (correct) information, but that is also available from other sites. Maybe not in this comprehensive form, but that is not a reason to add a link to it, the reason to add a link is that you want your statements to be attributed, and that can also be linked to somewhere else (it does not even need a working external link..). Therefore, again, Not done and referred to the appropriate whitelists (if necessery). --Dirk Beetstra T C (en: U, T) 12:58, 21 May 2008 (UTC)Reply
ru.wikipedia discussions:
Spam accounts:
Typically, we do not remove domains from the spam blacklist in response to site-owners' requests. Instead, we de-blacklist sites when trusted, high-volume editors request the use of blacklisted links because of their encyclopaedic value in support of our encyclopaedia pages. If such an editor asks to use your links, I'm sure the request will be carefully considered and your links may well be removed.
This blacklist is used by more than just our 700+ Wikimedia Foundation wikis (Wikipedias, Wiktionaries, etc.). All 3000+ Wikia wikis plus a substantial percentage of the 25,000+ unrelated wikis that run on our MediaWiki software have chosen to incorporate this blacklist in their own spam filtering. Each wiki has a local "whitelist" which overrides the global blacklist for that project only. Some of these non-Wikimedia sites may be interested in your links; by all means feel free to request local whitelisting on those.
Unlike Wikipedia, DMOZ is a web directory specifically designed to categorize and list all Internet sites; if you've not already gotten your sites listed there, I encourage you to do so -- it's a more appropriate venue for your links than our wikis. Their web address: http://www.dmoz.org.
--A. B. (talk) 20:17, 23 May 2008 (UTC)Reply
More complaints about this domain:
--A. B. (talk) 20:21, 23 May 2008 (UTC)Reply

Troubleshooting and problems

This section is for comments related to problems with the blacklist (such as incorrect syntax or entries not being blocked), or problems saving a page because of a blacklisted link. This is not the section to request that an entry be unlisted (see Proposed removals above).

I can't even find settv.com.tw in the blacklist but i was blocked for adding it into a page related to Taiwanese Drama in wikipedia.--User:mandy2701 19:58, 14 May 2008 (UTC)Reply

It isn't blacklisted on Meta (and on en.wp either) — VasilievV 2 19:06, 14 May 2008 (UTC)Reply

All of Nakon's spam blacklist entries referred to this subpage of his, not to Spam blacklist/Log.

Nakon has now deleted his subpage so I have recreated it at:

--A. B. (talk) 03:49, 14 May 2008 (UTC)Reply

~ender

When saving an edit, a blacklist page pops up, and there is no clear way to return to your edit, to attempt to fix whatever the issue is. This results in a loss of work, if you were working online in your browser window (instead of offline). This should be fixed.
~ender 2008-05-18 10:47:AM MST

If I am correct, this depends on your browser, some browsers indeed do not allow to return to the edit window. There is nothing the Spam blacklist or the meta people can do about this, I am afraid you will have to go to bugzilla for this, and/or poke one or more of the mediawiki software developers. Sorry. --Dirk Beetstra T C (en: U, T) 18:36, 18 May 2008 (UTC)Reply

Discussion

Looking ahead

"Not dealing with a crisis that can be foreseen is bad management"

The Spam blacklist is now hitting 120K & rising quite fast. The log page started playing up at about 150K. What are our options looking ahead I wonder. Obviously someone with dev knowledge connections would be good to hear from. Thanks --Herby talk thyme 10:46, 20 April 2008 (UTC)Reply

I believe that the extension is capable of taking a blacklist from any page (that is, the location is configurable, and multiple locations are possible). We could perhaps split the blacklist itself into several smaller lists. I'm not sure there's any similarly easy suggestion for the log though. If we split it up into a log for each of several blacklist pages, we wouldn't have a single, central place to look for that information. I suppose a search tool could be written to find the log entries for a particular entry. – Mike.lifeguard | @en.wb 12:24, 20 April 2008 (UTC)Reply
What exactly are the problems with having a large blacklist? --Erwin(85) 12:34, 20 April 2008 (UTC)Reply
Just the sheer size of it at a certain moment, it takes long to load, to search etc. The above suggestion may make sense, smaller blacklists per month, transcluded into the top level? --Dirk Beetstra T C (en: U, T) 13:16, 20 April 2008 (UTC)Reply
Not a technical person but the log page became very difficult to use at 150K. Equally the page is getting slower to load. As I say - not a techy - but my ideal would probably be "current BL" (6 months say) & before that? --Herby talk thyme 13:37, 20 April 2008 (UTC)Reply
I don't know how smart attempting to transclude them is... The spam blacklist is technically "experimental" (which sounds more scary than it really is) so it may not work properly. I meant we can have several pages, all of which are spam blacklists. You can have as many as you want, and they can technically be any page on the wiki (actually, anywhere on the web that is accessible) provided the page follows the correct format. So we can have one for each year, and just request that it be added to the configuration file every year, which will make the sysadmins ecstatic, I'm sure :P OTOH, if someone gives us the go-ahead for transclusion, then that'd be ok too. – Mike.lifeguard | @en.wb 22:12, 20 April 2008 (UTC)Reply
A much better idea: bugzilla:13805 bugzilla:4459 ! – Mike.lifeguard | @en.wb 01:43, 21 April 2008 (UTC)Reply
Just to note that my browser will no longer render the spam blacklist properly (though it's all there in edit view) - this is a real problem! – Mike.lifeguard | @en.wb 18:58, 14 May 2008 (UTC)Reply
Still there for me but "I told you so" :) By the time I'm back you will have got it all sorted out...... --Herby talk thyme 19:06, 14 May 2008 (UTC)Reply
My suggestion until then is to add another Spam blacklist 2 (needs configuration change). When logging, make sure you say which one you're adding to. We should possibly also split the current one in half so it will be easier to search and load. Configuration would look something like:
	$wgSpamBlacklistFiles = array(
		"DB: metawiki Spam_blacklist", //the current one
		"DB: metawiki Spam_blacklist_2" //the new one
		"DB: metawiki Spam_blacklist_3" //we can even have them configure an extra so that when #2 gets full-ish, we can just start on #3
	);
If we want to do this, it is an easy configuration change - took me <30s on my wiki. Using multiple blacklists is a pretty good solution until we can get a special page to manage the blacklist. The only downside I can see is you'd have to search more than one page to find a blacklist entry (but at least the page will load and render properly!) – Mike.lifeguard | @en.wb 16:54, 15 May 2008 (UTC)Reply

I'm thinking of writing a new extension which works based on a real interface, and allows much better management. Werdna 08:15, 16 May 2008 (UTC)Reply

Until then, I would suggest to go with Mike.lifeguard's suggestion. What about splitting of the old part? Or the unlogged part into a 'old' spam blacklist, and for the active cases to work with the normal, current blacklist. In that case there is no confusion where to add to, and things render properly. The old one is only edited when deleting an entry. --Dirk Beetstra T C (en: U, T) 08:51, 16 May 2008 (UTC)Reply
What is the 'old' part? I think it's a good idea having one active list. In any case there should be some logic in splitting the lists, so choosing what list to add to won't be arbitrary. --Erwin(85) 11:17, 16 May 2008 (UTC)Reply
It is a good idea, but only if it works. Right now it doesn't work for me, so I see this as a problem that needs fixing sooner rather than later. I think Beetstra's method of splitting it up would be fine. – Mike.lifeguard | @en.wb 17:20, 16 May 2008 (UTC)Reply

Feedback on logging & people who "help" please

I know this is not the most popular place in the world to work & I am grateful for almost anyone helping out. However we do have one or two people who are quite intransigent as far as following policy/practice whatever as far as logging is concerned.

Nakon shows pretty much indifference to working with practice in a number of ways. They prefer their own logging method, regex is incomplete so catches innocent sites (I've fixed it). As far as the bot ones tackled were concerned no comment as to why they were either listed or closed without listing was given and quite a few have been removed or re-opened by Drini, Dirk or myself.

Equally Raul654 seems unconcerned about leaving others to sort out any problems caused (& I am not sure they actually understand that this blacklist is only for cross wiki issues).

As I pointed out to Majorly today if were are preventing a site from being included in something approaching 30,000 wikis then I think our grounds need to be valid & visible except in emergencies (in this case it is merely a misunderstanding of how logging works I think). Otherwise the Foundation could genuinely face some heavy questioning.

Am I getting this completely wrong? If so do say, if not how do we deal with such "help"? Thanks --Herby talk thyme 09:02, 30 April 2008 (UTC)Reply

One of the reasons I avoid this area like the plague is because it is a pretty vile area to work in. Not only is it full of drama and various conflicts and issues, the whole thing is really complex to log, archive etc, without Herby questioning your every move on your talk page (including asking why I blacklisted a well known Wikipedia stalker's website...). I hadn't logged stuff before, and when I did for my latest addition, I followed the format of the entry above me, which inevitably was wrong. I never follow requests from the talk page, so providing diff links would be impossible. I tend to only add stuff asked elsewhere, on IRC or whatever. To me, it was unclear how to format the log (I was unaware they were in groups), and have received no help on this issue ("just log" isn't help). I'd like to help out when I can, but seeing as it appears to have annoyed one of the most active admins on this page, I'll stop helping. Majorly (talk) 09:22, 30 April 2008 (UTC)Reply
30,000 wikis? Are you sure about that? Majorly (talk) 09:25, 30 April 2008 (UTC)Reply
I have pointed out to Majorly that this is not about him. I realised that the logging instructions might not be clear. All help is welcome - however we must respect the fact that if it is not done as well as we can we cause issues for others who follow. As to the number affected A. B.'s comments here are his standard ones! --Herby talk thyme 09:41, 30 April 2008 (UTC)Reply
Well you learn something new everyday! I had thought it was only Wikimedia wikis that were affected. Still, we shouldn't be responsible for every site that isn't in the scope of our project, imo. Majorly (talk) 12:03, 30 April 2008 (UTC)Reply
Regarding logging, though I log everything, the reason that I provide varies. On en.wikipedia I blacklist/RevertList (latter for XLinkBot) with as proof a Special:Contributions of a user, or one of the bot reports. Those items are clear enough. If needed I create a 'request' on the request list, which I close immediately, and link to that. For me, the proof has to be enough, it does not have to be complete. I do believe it is important that the logging is done in the appropriate section (current month in the /log), and that there is a link to any form of proof (COIBot, SpamReportBot, special contributions, luxo, a request, whatever). Items on the blacklist that do not have a link to 'proof' should be removed immediately until proof has been provided. No excuses for that.
Regarding the formatting, practically all rules, except the more complex ones, should be encapsulated in '\b' tags. So example.com becomes \bexample\.com\b (also escape the period, though that is not strictly necessery, it may in some obscure cases be necessery; e.g. 'viacom' would be caught by '\bvi.com\b'). --Dirk Beetstra T C (en: U, T) 10:31, 30 April 2008 (UTC)Reply


Hi, just my few thoughts on this, I don't think to work here is 'vile' or something because of the logging, which is the most easy part of everything. There is even a nice snippet with a link which makes logging very easy. The most time taking part is to check the links to check if it is a crosswiki problem, etc. which probably keeps people away as soon as they realize that this actually is hard work.
I think logging should be a matter of course, I don't see any reason to question its need.
The only thing that I never log is an emergency adding that I remove later (tell me if I should), if I don't remove it, I add something to the talk and log it.
While at it, thanks a lot to all that keep this list running!
Best regards, --birdy geimfyglið (:> )=| 11:53, 30 April 2008 (UTC)Reply
I've worked closely with this list, first as just an editor, more recently as an admin, over about 18 months and about 1000 edits. Everything Herby has written is absolutely right (except possibly my guesstimate of the non-Wikimedia wikis affected that he quotes; all we know is that it's all Wikia plus thousands and thousands of others). Failure to properly log blacklist entries today creates a lot of work for others in the future. I've wasted half an hour or more stepping through hundreds of edits in the blacklist's edit history to find out who blacklisted something and why because they didn't bother to properly log an addition. These searches come in response to whitelist requests as well as problems our regular editors have adding unrelated links because of some glitch in the regex.
I encourage meta admins to follow Herby's instructions on this. If you find work here "vile", then just list the domains on the talk page and let other admins handle it. If you are going to blacklist a domain, then please take the time to do it right, including logging. Normally there should be a talk page entry as well that gives edit histories and/or diffs.
It's unfortunate if someone screws this up out of ignorance while trying to be helpful but we are all human and I've made plenty of mistakes. It's less acceptable to snap at Herby or others if they point out your mistake and ask you to fix it. It's downright uncollegial, unhelpful and arrogant to deliberately ignore the necessary processes here once they're pointed out. It's just creating problems for others. I hate to say this, but I don't think someone with such an attitude should be an admin here. --A. B. (talk) 19:29, 30 April 2008 (UTC)Reply
I'll also add that an "IRC request" is an inadequate justification for blacklisting, given the non-transparent nature of IRC and the lack of any record. I love Gladys Knight, but "I Heard It Through the Grapevine" is just a song, not a sound basis for blacklisting for more than a few minutes until a more complete justification can be recorded with cross-wiki diffs.
Links to "attack sites" can be controversial and those domains are best listed here on the talk page for discussion and consensus-building before blacklisting. Otherwise, we subsequently get disputes over "en.wikipedia imperialism" (or es.wikipedia or whatever); the link either gets removed here or else other projects just whitelist it locally. If something is not classic spam, it pays to build consensus here first. For instance, wikipedia-watch.org is blacklisted here but used on multiple other projects.
--A. B. (talk) 20:04, 30 April 2008 (UTC)Reply
In that case, I have made up my mind for sure. I was asked over Skype to add the website of a long-term Wikipedia stalker (has driven several female admins off the project). Even logging it will cause issues, as it draws attention to this user, which is the opposite of what we want. I got asked over IRC yesterday to add a URL that was appearing in HAGGER page moves. Again, absolutely no reason this will ever need removing, and again, logging it just draws attention to the troll behind it. I am not going to put up with doing this anymore if people are unreasonably demanding I log things which really should not be logged at all, and stating IRC request isn't enough. Of course it's enough - I gave a reason for the addition, which is precisely the same as adding it on the page - just wastes more time doing it that way. Anyhow, after an extremely brief time adding stuff to this page, I have decided I will no longer add stuff to this list, as it only causes issues and more problems that it's really worth. Thanks, Majorly (talk) 20:58, 30 April 2008 (UTC)Reply
I have removed all of my contributions to this list. If you're going to insist on following some confusing policy, I'm not going to deal with it any more. Nakon 21:06, 30 April 2008 (UTC)Reply
A link not ever needing removal doesn't mean that it will not be requested removed. Not being able to find any reason for why it was added to the blacklist might lead to the link being removed. --Jorunn 21:13, 30 April 2008 (UTC)Reply
Trust me, the stuff I add here will never need removing. I avoid this page, and only add stuff if people specifically ask me - and it's usually really bad stuff. Majorly (talk) 21:19, 30 April 2008 (UTC)Reply
@Nakon: >delisting every one of mine[21] As soon as You press enter You release Your contributions into GFDL. Where do You have the additions from that You added, from reports of bots/users? If so, removing is a disruption of this list and a destruction of their requests and work on this (some users work really hard to fight spam cross-wiki and to fill out very long reports; in my opinion such action is quite disrespectful). Thanks, --birdy geimfyglið (:> )=| 21:40, 30 April 2008 (UTC)Reply

Re Majorly. I can understand that there are cases where you don't want an entry tied to a user, diff, or actions. In that case I would make a clear entry in the log, that you are the person adding it, stating some reason, and e.g. adding a permanent contact address where you can be contacted (if you decide in ## years to leave, these things may still needed to be sorted out afterwards). It may even be logged to a private email sent to the foundation, which does not have to be disclosed to any here, as long as the foundation follows up on it when necessery. I think that would be a good solution. --Dirk Beetstra T C (en: U, T) 09:14, 1 May 2008 (UTC)Reply


good way to prove a point!

Nakon's last entry to the blacklist was to block the entire interfree.it domain - the rationale "IRC request from betacommand''. Now there is an appeal above (here) from what I imagine was not a subdomain that it was planned to block. The point is "how the hell do we know"! I've de-activated it for now - I guess someone should find out what was intended though frankly I cannot be bothered.

Yes I could be wrong - how the hell would I know & - Yes I am angry --Herby talk thyme 11:53, 1 May 2008 (UTC)Reply

Herby, please assume good faith. How about asking Nakon why it was added instead of cursing here on the talk page about it? Majorly (talk) 12:09, 1 May 2008 (UTC)Reply
Indeed - however previous enquiries on his talk page have met with no success (indeed I see he has now blanked it) and he seems to have detached himself from working here. As such those who do work here are left to deal with the results of others peoples work (as I have been doing for sometime).
I am certainly prepared to assume good faith as I imagine you might be. However above this section some people are saying that the logging is pointless, useless, unnecessary etc so it would be nice if they all assumed good faith with me & communicated in a constructive & pleasant manner.
Equally (I hope) it does rather prove the point about how logging with full rationale is rather less than optional & is actually vital to those who work on this page. Thanks --Herby talk thyme 12:43, 1 May 2008 (UTC)Reply
His stubborn attitude doesn't help (he removed all his entries from the list last night, then blanked his talk page). However, I don't suppose he'll be adding much else if he's detatching himself from Meta. If I ever add stuff again (which I doubt), I'll log it, but I think the fuss being made here is a little over the top imo. Especially when the stuff I add is stuff that should never be removed. Majorly (talk) 13:08, 1 May 2008 (UTC)Reply

(editconflict)

Please don't ask this of Herby, I think he assumed good faith explaining and asking everyone kindly to just log (his approach has just been blanked from the talkpage, so I understand him not to feel welcomed or expecting an answer on this talkpage), and I understand perfectly well his frustration.
Why, in case of such removal requests, should it be done that way: asking the sysop who added it? Instead of just checking it out by a single click?
That could in some cases also mean:
  • going through the version history, finding out who added it,
  • dropping the one who added it a message on his talk and
  • waiting for his answer - if there ever is one
What if the sysop does not rembember why because it was a year ago or if (which is not so impossible) he just left the project?
Instead of copying this snippet into a page and that was it.
This is imho just a stubbornness that is causing only redundant work and wasting peoples time. Thanks, --birdy geimfyglið (:> )=| 12:54, 1 May 2008 (UTC)Reply
I'm not sure what the problem is since logging is not that hard to do and I agree mostly with Herby and I do think things have been done a bit harshly the last couple of weeks and logging will prevent future discourse. I barely edit the spam blacklist since I'm on dial-up and that page it just too big for my browser (internet connection) to handle but I have added some to the blacklist and even if it takes me 10 minutes to add to blacklist, I still do it happily since as pointed out above nearly 30,000 wikis share our spam blacklist and we must make sure that we do think of them as well, Logging isn't really a problem and some tried top do it in a different way, Meta is about coordination and it will be really nice if every thing we do is well coordinated and not lying all over the place...--Cometstyles 13:05, 1 May 2008 (UTC)Reply
I think it is abundantly clear to all that help is not helpful unless we're all working from the same page. Logging is not optional. Please be considerate of those you are working with. Logging might not be policy, but we should all respect standard practice. It's there with good reason; much work is created when things are done improperly. Hopefully bugzilla:13805 will help in this respect, but it is not coming tomorrow. So until something like this is implemented, the current system of logging requests is the only way we can all cooperate effectively on this task. Though w:WP:POINT is only policy at enwiki, this is still unacceptable behaviour. If you are not going to cooperate and collaborate on this with everyone else, then at least do not hinder our efforts. – Mike.lifeguard | @en.wb 15:00, 1 May 2008 (UTC)Reply
Actually WP:POINT seems to be found on all the big Wikipedias (except Volapük):
--A. B. (talk) 15:19, 1 May 2008 (UTC)Reply
PS: I wonder if this means the Volapüktians are especially tranquil and don't need such a rule or … very disruptive and don't want a rule like that? They bear watching.

What about tool that once a day analizes whole spam blacklist history and dumps author of every line? — VasilievVV 15:23, 1 May 2008 (UTC)Reply

Useful, but we also need the reason each line (or set of lines) was added. – Mike.lifeguard | @en.wb 15:39, 1 May 2008 (UTC)Reply
Well, VasilievVV's tool would be a start in terms of sorting out the old entries -- at least we'd know whom to contact -- that's if they're still editing here and if they can remember the details. We still need a regularly maintained log, however. --A. B. (talk) 15:44, 1 May 2008 (UTC)Reply
Here's an example that originated today on en.wikipedia of an unlogged Meta blacklisting: Talk:Spam blacklist#cabinda.net query. Anyone care to track this one down -- who added it? why was it added? was the blacklisting justified then? is it still justified? Or do we remove it and move on? --A. B. (talk) 15:49, 1 May 2008 (UTC)Reply

I've done it: Spam blacklist/BlameVasilievVV 16:00, 6 May 2008 (UTC)Reply

  • Cool. I'd like to speak up in Herby's defence here, since I work both sides of this equation - requests for clarification as to why something was blacklisted, via OTRS, and requests for blacklisting. It is massively easier to authoritatively answer an OTRS or other complaint from a site owner if we have a proper log entry that you can actually find. I know I sometimes forget as well, I don't do bureaucracy well, but proper logging is an essential part of our public-facing duty. This is not like a single project blacklist where issues are localised and easily fixed, meta is a very small project but with very big impact. Just think of it as change control and learn to love it for the once in a lifetime that it digs you out of a hole. JzG 13:57, 9 May 2008 (UTC)Reply
re: Spam blacklist/Blame -- VasilievVV, is this a list of unlogged blacklistings? If so, I've been a very bad boy and will start working on my share. Thanks for putting this together. JzG is so right in what he says. --A. B. (talk) 15:28, 9 May 2008 (UTC)Reply
I dumped history of spam blacklist, parsed current version and looked through all history to find in which revision it was added. Some problem caused several blacklist blankings — VasilievVV 15:31, 9 May 2008 (UTC)Reply
I am sure there are things wrong with /Blame (or I misunderstand the list). I may have forgotten an odd log-addition, but by far not as many as what is written there. I presume this is a blame list which also contains the things that were logged (a quick check .. I did log \bpetrsoukal\.php5\.cz\b somewhere in April, still it is in this list). --Dirk Beetstra T C (en: U, T) 18:36, 9 May 2008 (UTC)Reply
You misundersttod me. It's for all log entries — VasilievVV 18:37, 9 May 2008 (UTC)Reply
Would it be doable to parse this list against the log, and create from the difference a 'nonlogged' (or 'wrongly' logged) list? For those items where we still do remember why it was blacklisted we could still add an entry in the log. --Dirk Beetstra T C (en: U, T) 18:41, 9 May 2008 (UTC)Reply
Done, see Spam blacklist/Blame/Unlogged
Thanks very, very much, VasilievVV. Sadly, I see that I've got a big listing to fix on that list and I will take care of it.
In the interests of working on the most unlogged domains with the least investigatory effort, here's a list I compiled:
--A. B. (talk) 17:23, 10 May 2008 (UTC)Reply
This is odd. I went to log my edit r885352 which appears on the list at Spam blacklist/Blame/Unlogged. Yet, I found that it was already logged. I made a typographical error when logging my action (one entry was "\dalekozor\.com\b") which I did not make with the actual blacklist entry ("\bdalekozor\.com\b"). Perhaps this mistake caused the anomaly. --A. B. (talk) 17:30, 10 May 2008 (UTC)Reply
It goes through every spam blacklist entry, and then checks if it is in log — VasilievVV 17:36, 10 May 2008 (UTC)Reply

LinkWatchers

I am working on both loading the old database (about 4 months worth of links) and to rebuild the current database (about 5-6 weeks worth of links) into a new database.

  • The old database is in an old format, and has to be completely reparsed..
  • The new database had a few 'errors' in it, and I am adding two new fields.

I am running through the old databases by username, starting with aaa.

As a result the new database does not contain too much data yet, and will be 'biased' towards usernames early in the alphabet.

This process may take quite some time, maybe weeks, as I have to throttle the conversion to keep the current linkwatchers 'happy' (they are still running in real-time). These linkwatchers are also putting their data into this new database, so at everything after about 18:00, April 29, 2008 (UTC) is correct and complete.

The new database contains the following data, I will work later on making that more accessible for on-wiki research:

  1. timestamp - time when stored
  2. edit_id - service field
  3. lang - lang of wiki
  4. pagename - pagename
  5. namespace - namespace
  6. diff - link to diff
  7. revid - the revid, if known
  8. oldid - the oldid, if any
  9. wikidomain - the wikidomain
  10. user - the username
  11. fullurl - the full url that was added
  12. domain - domain, indexed and stripped of 'www.' -> www.example.com becomes com.example.
  13. indexedlink - rewrite of the fullurl, www.example.com/here becomes com.example./here
  14. resolved - the IP for the domain (new field, and if found)
  15. is it an ip - is the edit performed by an IP (new field)

I'll keep you posted. --Dirk Beetstra T C (en: U, T) 10:31, 30 April 2008 (UTC)Reply

Well .. keep people posted:

We had two different tables, linkwatcher_linklog and linkwatcher_log. The former in a reasonable new format, the old one in a very old, outdated format.

The table linkwatcher_linklog is being transferred into linkwatcher_newlinklog, and when a record is converted, it is moved to a backup table (linkwatcher_linklogbackup). That conversion is at about 31% now.

The table linkwatcher_log is completely reparsed, and when a record is converted, the record is transferred into linkwatcher_logbackup. Also for this table the conversion is about 29%.

All converted data goes into linkwatcher_newlinklog, as does the data that is currently being recorded by the linkwatcher 'bots'.

  • linkwatcher_linklog - 1,459,158 records - 946.7 MB
  • linkwatcher_linklogbackup - 646,127 records - 323.5 MB
  • linkwatcher_log - 1,526,600 records - 1.0 GB
  • linkwatcher_logbackup - 628,575 records - 322.8 MB
  • linkwatcher_newlinklog - 2,152,052 records - 1.1 GB

Still quite some time to go. The conversion of linkwatcher_linklog is at usernames starting with 'fir', linkwatcher_log is at usernames starting with 'jbo'. --Dirk Beetstra T C (en: U, T) 20:54, 9 May 2008 (UTC)Reply

Update:

  • linkwatcher_linklog - 761,429 records - 946.7 MB
  • linkwatcher_linklogbackup - 1,036,146 records - 525.7 MB (58% converted)
  • linkwatcher_log - 1,159,216 records - 1.0 GB
  • linkwatcher_logbackup - 995,959 records - 501.8 MB (46% converted)
  • linkwatcher_newlinklog - 3,448,605 records - 1.8 GB

linklog is at "OS2", log is at "Par". --Dirk Beetstra T C (en: U, T) 15:01, 16 May 2008 (UTC)Reply

Update (the first one is starting to convert IPs):

  • linkwatcher_linklog - 303,562 records - 946.7 MB
  • linkwatcher_linklogbackup - 1,494,013 records - 754.8 MB (83% converted)
  • linkwatcher_log - 732,773 records - 1.0 GB
  • linkwatcher_logbackup - 1,422,402 records - 711.0 MB (66% converted)
  • linkwatcher_newlinklog - 5,062,298 - 2.6 GB

linklog is at '199', log is at 'xli' (had to take one down for some time, too much work for the box). --Dirk Beetstra T C (en: U, T) 19:19, 25 May 2008 (UTC)Reply

Excluding our work from search engines

This is a bigger problem for enwiki than for us, but still... I'd like to ask that subpages of this page be excluded from indexing via robots.txt so we do not receive complaints about "You're publicly accusing us of spamming!" and the like. These normally end up in OTRS, where it is a waste of volunteers' time and energy. The MediaWiki search function is now good enough that we can use it to search this site for a domain rather than relying on a google search of meta (ie in {{LinkSummary}} etc). As well, we'll include the subpages for COIBot and LinkReportBot reports. – Mike.lifeguard | @en.wb 20:05, 10 May 2008 (UTC)Reply

I have made a bug for this: bugzilla:14076. – Mike.lifeguard | @en.wb 20:10, 10 May 2008 (UTC)Reply
Good idea. There's no need for these pages to be indexed. --Erwin(85) 08:06, 12 May 2008 (UTC)Reply
Mike, I strongly agree with excluding crawlers from our bot pages.
I very much disagree, however, with excluding crawlers from this page and its archives. In many cases, seeing their name in a Google search is the first time at least half our hard-core spammers finally take us seriously. Since they usually have other domains we're unaware of, this deters further spam.
If domain-owners feel wronged about entries on this page, overworked OTRS volunteers should feel free to direct them to the removals section here and we can investigate. In my experience, many Wikipedia admins and editors don't enough experience with spam to know how to investigate removal requests and separate the sheep from the goats.
If there's been a false report, we can move the entries from our crawlable talk archives to a non-crawlable subpage (call it "false positives").
I'll also note that I think we've been getting more false positives blacklisted since we got these bot reports. I continue to feel strongly that we must be very conservative in blacklisting based on bot reports. If a site's been spammed, but it looks useful, wait until some project complains or blacklists it. Even if a site's been spammed and doesn't look useful, if the spammer hasn't gotten enough warnings then we shouldn't blacklist it unless we know he fully understands our rules -- that or we get a complaint from one of our projects. Perhaps we should have our bots issue multi-linual warnings in these cases.
As for the spammer that's truly spammed us in spite of 4 or more warnings, I don't care if he likes being reported as a spammer or not. I've spent almost two years dealing with this subset of spammers and they're going to be unhappy with us no matter what we do until they can get their links in. --A. B. (talk) 13:39, 12 May 2008 (UTC)Reply
I agree with you that the bot reports probably have a threshold that are too low - perhaps that can be changed. Until such time, they need to be handled carefully.
The problem I am talking about is not false positives. Those are very straightforwardly dealt with. The real problem is not with people emailing us to ask to get domains de-listed, but rather with people emailing us demanding that we stop "libeling" them. What they want is for us to remove all references to their domain so their domain doesn't appear in search results next to the word "spam". Well, I'm not prepared to start blanking parts of the archives to make them stop whining - we need these reports for site maintenance. Instead we can have our cake and eat it too: don't allow the pages to be indexed, and keep the reports as-is. I'm not sure I see how having these pages indexed deters spammers. What I do see is lots of wasted time dealing with frivolous requests, and a way to fix that.
Just a reminder to folks that we should discuss this here, not in bugzilla. I am closing the bug as "later" which I thought I had done earlier. Bugzilla is for technical implementation (which is straightforward); this space is for discussion. They will yell at us if we spam them by discussing on bugzilla :) – Mike.lifeguard | @en.wb 16:16, 12 May 2008 (UTC)Reply
If a site-owner has truly spammed us in spite of repeated warnings, then we are not libeling them if search engines pick up their listings here. I've dealt with such complaints before; I point out the clear evidence that's a matter of public record: warnings, diffs and our rules. I tell them if they find any factual inaccuracies in that record to let us know and we'll fix it immediately. I'm happy to discuss these blacklisting decisions with site-owners that bring them to the removals section.
Wikimedia's organization and servers are based in the United States; libel cases there are very difficult to pursue. A fundamental tenet there is that truth is an absolute defense. If our records are true and spammers have been previously warned and apprised of our rules, then they don't have a leg to stand on in that jurisdiction.
Servers may be based in the US, but they are accessible in the world. Thus you're liable in other courts, like say the UK (unless wikia's servers are in NY[22], whose law is questionable at best). In the UK just getting your case defended will cost you $200,000-plus up front, and much more if you lose - and if you are not represented, judgment will be entered against wikia in default. [23]
~ender 2008-05-18 11:04:AM MST
As for deterrence, periodically perusing hard core spammer forums like seoblackhat.com and syndk8.net as well as more general SEO forums like forums.digitalpoint.com will show lively discussions as to whether spamming us is worth the risks and aggravation. The more negative the chatter there about "wiki link-nazis", the better off our projects are.
My sense is that the volume of complaints has grown recently as we've blacklisted more domains based on bot reports. Some of these site-owners may not be truly innocent but they're not hard-core and haven't been warned sufficiently. Blacklisting comes as a large, alarming shock to them.
--A. B. (talk) 17:03, 12 May 2008 (UTC)Reply
Of course it's not real libel. But that doesn't stop them from complaining, which is a waste of time. And a needless waste of time when it can be so easily stopped. On the other hand, I do see your point about being perceived as the link Nazis. Perhaps someone other than the three of us can share some thoughts?
Also, you're assuming there's a presumption of innocence. In other legal jurisdictions (and for a number of things in the US) that is not the case. You will have to prove that it is not libel.
~ender 2008-05-18 11:09:AM MST
I agree with excluding these pages from search engines. While Meta is nowhere near as high on searches as enwiki, it won't take much. Cary Bass demandez 23:18, 13 May 2008 (UTC)Reply

I also agree, although I do believe that the fact that these reports rank so high works preventive. Finding these reports of other companies should stop other companies from doing the same. But the problems with the negative impact these reports may have on a company (although that is also not our responsibility, when editing wikipedia they get warned often enough that we are not a vehicle for advertising!) I think it is better to hide them, especially since our Bots are not perfect and sometimes pick up links wrongly, as well as we are only human and may make mistakes in reporting here as well. --Dirk Beetstra T C (en: U, T) 09:45, 14 May 2008 (UTC)Reply

Dirk, I think the way to "have our cake and eat it" is to have this talk page and its archives crawlable and just exclude the bot pages. As for mistakes here, I don't see many true mistakes in human-submitted requests that actually get blacklisted. I at least skim at almost all the blacklist removal and whitelist requests both here and on en.wikipedia. The most common mistake humans make is to unknowingly exclude all of a large hosting service (such as narod.ru) instead of just the offending subdomain; I have yet to see a large hosting service complain. Otherwise, >>90% of our human-submitted requests nowadays have been pretty well thrashed out on other projects first before they even get here. Of those that are still flawed, they either get fixed or rejected with a public discussion. It's not that humans are so much smarter than bots, but rather, like so many other things on wiki, after multiple edits and human editors, the final human-produced product is very reliable.
I spend several hours a month reading posts on several closed "black hat" SEO forums I'm a "member" of. The reliability, finality and public credibility of our spam blacklisting process bothers a lot of black hats. I think our goal should be to keep keep this going. --A. B. (talk) 12:30, 14 May 2008 (UTC)Reply
The point isn't whether we're right or wrong. The point is whether they complain and waste our time or not. It seems to me that looking like link Nazis publicly is not a very strong rationale if it conflicts with the goal of doing the work without wasted time. That said, views on this may differ. – Mike.lifeguard | @en.wb 16:01, 14 May 2008 (UTC)Reply
Would it be too radical if we ask for a way to choose which pages shouldn't be indexed, onwiki? An extension can be easily created and installed for Meta which lets us add a <noindex /> tag to the top of the article, and cause an appropriate "noindex" meta tag to be added to the HTML output, and thus prevent that page (and only that page) from being indexed. How do you feel about my raw idea? Is it too prone to being abused? Huji 13:38, 15 May 2008 (UTC)Reply
Sounds good, but that would, if enabled on en, give strange vandalism I am afraid. I would be more inclined in an extension which listens to a MediaWiki page, where pages can be listed which should not be indexed (that should include their subpages as well). Don't know how difficult that would be to create. --Dirk Beetstra T C (en: U, T) 16:25, 15 May 2008 (UTC)Reply

Hi, Why have you removed links from this article? and why have you blacklisted allaahuakbar.net & asipak.com? My ID on en wikipedia is asikhi--203.81.214.68 06:49, 12 May 2008 (UTC)Reply

Please see Talk:Spam_blacklist#Six_links_from_cross_wiki_for_discussion. --Dirk Beetstra T C (en: U, T) 09:17, 12 May 2008 (UTC)Reply

For info & comments: Since this bug was done in rev:34769, we can possibly be more aggressive in blacklisting domains. The new behaviour is that the page can be saved if the URL was present in the previous revision - so the spam blacklist will stop only new spam - past spam must be handled separately as it will stay there until removed (and you won't be stopped from saving the page). – Mike.lifeguard | @en.wb 03:50, 14 May 2008 (UTC)Reply

Thanks for the heads up. It's also useful for whitelisting. I whitelisted a domain on nlwiki yesterday because it was used as a source. That kind of forced me to whitelist even though I agreed with blacklisting here. I guess in such a case you don't necessarily have to whitelist now. --Erwin(85) 09:48, 14 May 2008 (UTC)Reply
Would this also mean that the bots can put a working link in their reports, so the reports can be found via a Special:LinkSearch (easy, just adapt the LinkSummary template)? --Dirk Beetstra T C (en: U, T) 14:37, 16 May 2008 (UTC)Reply
I guess so. The links are reported before blacklisting, so in case the address is blacklisted because of the report we can still comment. --Erwin(85) 14:49, 16 May 2008 (UTC)Reply
Lets try .. or will this make our reports increase in google-findability?? --Dirk Beetstra T C (en: U, T) 14:53, 16 May 2008 (UTC)Reply
The fact that it's a link as opposed to plaintext is irrelevant as we have nofollow on WMF sites. Also, it looks like we will be excluding bot reports from indexing altogether. – Mike.lifeguard | @en.wb 17:18, 16 May 2008 (UTC)Reply

Useful investigation tool: url-info.appspot.com

In investigating User:COIBot/XWiki/url-info.appspot.com, I found that the linked site provides a small browser add-on that's a very useful tool for investigating all the links embedded in a page, whether it's a Wikimedia site page or an external site. I recommend others active with spam mitigation add it to their browser toolbar and check it out. This could have saved me many hours in the last year:

If you're trying to find possible related domains links on a spam site to investigate, this will quickly list them all, sparing the aggravation of clicking on every link. In fact it's so easy to glean information that we'll need to ensure we're not mindlessly reporting unrelated domains as "related" when they've appeared on a spam site page for some innocent reason:

As for the spam report for this domain, I don't think the extent of COI linking (just 1 link to each of 4 projects) currently meets the threshold for meta action; local projects can deal with this as they see fit. The tool is free and the page has no ads.

Note that appspot.com, the underlying main domain, is registered to Google for users of its App Engine development environment. --A. B. (talk) 14:22, 21 May 2008 (UTC)Reply

It could be useful. It simply lists external links though, so like you said I guess most of 'm have nothing to do with the site. Note that the add-on is a w:en:bookmarklet, so not an extension or something. --Erwin(85) 16:06, 21 May 2008 (UTC)Reply

Closing old bot reports

I've been checking some old reports in Category:Open XWiki reports and most link additions were a few weeks or months old. I haven't come across a url that I think should be blacklisted and so I'm wondering if anyone has objections to closing all reports with less than say 3 edits and over a month old. I'm not suggesting to use a bot, but I am suggesting to close them without checking diffs. --Erwin(85) 16:06, 21 May 2008 (UTC)Reply

I would say, do as you see fit. The reports are there, they may contain a couple which are really bad and we can prevent future troubles, the rest go. If it reoccurs, we will probably see it, and if it is really bad, add it now. Don't waste too much time on it, and don't worry if you close a bad one without adding .. --Dirk Beetstra T C (en: U, T) 16:15, 21 May 2008 (UTC)Reply
Actually, I was considering using a bot to mass-close them. They are so stale as to be largely useless. The bot will bring back anything needing further attention. I suppose that would run the chance of missing something, but I am not prepared to spend enough time to go through them thoroughly. If you are, feel free.  – Mike.lifeguard | @en.wb 19:18, 21 May 2008 (UTC)Reply
Using a bot would be fine by me as long as it's used for old cases. --Erwin(85) 19:59, 21 May 2008 (UTC)Reply
Perform a close-run on everything with less than 5, and see what is left over. I am not worrying about them at all, but we may be able to blacklist some real rubbish based on them, so we don't have to do that whenever and have extra work. But if they get all just closed and then we see again .. they are not in the way, COIBot ignores the SpamReportBot ones anyway. --Dirk Beetstra T C (en: U, T) 09:47, 22 May 2008 (UTC)Reply
I just set up a bot to close the reports. Using the toolserver I created a list of 123 reports with no edits in the last month. From those reports there were 40 with less than 5 links. My bot is closing them now. --Erwin(85) 13:14, 22 May 2008 (UTC)Reply

Thresholding the xwiki

The linkwatchers now calculate for each link addition the following 4 values (all according to database filling):

  1. UserCount - how many external links did this user add
  2. LinkCount - how often is this external link added
  3. UserLinkCount - how often did this user add this link
  4. UserLinkLangCount - to how many wikipedia did this user add this link.

The threshold was first:

if ((($userlinklangcount / $linkcount) > 0.90)  && ($linkcount > 2) && ($userlinklangcount > 2)) {
   report
}

I noticed, that when one user performs two edits on the first linkaddition in one wiki, and then starts adding to other wikis as well, that the user gets reported at edit 11, which I found way too late:

  • 3/3
  • 10/11
  • 11/12

earlier/inbetween combinations are not passing that threshold ..

The code is now:

if ((($userlinkcount / $linkcount) > 0.66)  && (($userlinklangcount / $linkcount) > 0.66 ) && ($userlinklangcount > 2)) {
  report
}

This is (userlink/link & wikis/link):

  • 3/3 & 3/3
  • 3/4 & 3/4
  • 4/5 & 4/5
  • 5/6 & 5/6
  • 5/7 & 5/7
  • 6/8 & 6/8

etc.

I am thinking to also do something like ($userlinkcount < xxx), which should take out some more established editors, xxx being .. 100 (we had a case of one editor adding 20 links in one edit .. you need to be hardcore to escape 100, adding 34 links every edit)?

I want to say here, the threshold is low, and maybe should be. Cleaning 10 wikis when crap is added is quite some work, it is easier to close/ignore one where only 4 wikis were affected. I will let this run, see what happens, this may give a lot more work, in which case I am happy to put the threshold higher.

Comments? --Dirk Beetstra T C (en: U, T) 15:15, 22 May 2008 (UTC)Reply