Talk:Spam blacklist: Difference between revisions

From Meta, a Wikimedia project coordination wiki
Content deleted Content added
Mike.lifeguard (talk | contribs)
Mike.lifeguard (talk | contribs)
→‎Caracal pistol, italian Wiki article: related domains, more spamming
Line 272: Line 272:
:Clearly that domain is not blacklisted at meta. &nbsp;'''&mdash;&nbsp;[[User:Mike.lifeguard|<b style="color:#309;">Mike.lifeguard</b>]]'''&nbsp;&#124;&nbsp;<sup>[[User talk:Mike.lifeguard|<span style="color:#309;">talk</span>]]</sup> 16:23, 7 September 2008 (UTC)
:Clearly that domain is not blacklisted at meta. &nbsp;'''&mdash;&nbsp;[[User:Mike.lifeguard|<b style="color:#309;">Mike.lifeguard</b>]]'''&nbsp;&#124;&nbsp;<sup>[[User talk:Mike.lifeguard|<span style="color:#309;">talk</span>]]</sup> 16:23, 7 September 2008 (UTC)
:caracal-pistol.info however is. Given your conflict of interest in this case, the cross-wiki additions and our norm of declining de-listing requests from site owners, this request is {{declined}}. &nbsp;'''&mdash;&nbsp;[[User:Mike.lifeguard|<b style="color:#309;">Mike.lifeguard</b>]]'''&nbsp;&#124;&nbsp;<sup>[[User talk:Mike.lifeguard|<span style="color:#309;">talk</span>]]</sup> 16:30, 7 September 2008 (UTC)
:caracal-pistol.info however is. Given your conflict of interest in this case, the cross-wiki additions and our norm of declining de-listing requests from site owners, this request is {{declined}}. &nbsp;'''&mdash;&nbsp;[[User:Mike.lifeguard|<b style="color:#309;">Mike.lifeguard</b>]]'''&nbsp;&#124;&nbsp;<sup>[[User talk:Mike.lifeguard|<span style="color:#309;">talk</span>]]</sup> 16:30, 7 September 2008 (UTC)


OK, given the related domains, and additions by Quickload, I think this may have turned into a request for listing. I'm normally not a fan of listing related domains, but Quickload seems to have a COI here, and is adding sites cross-wiki. Looking for input here. Related domains listed below. &nbsp;'''&mdash;&nbsp;[[User:Mike.lifeguard|<b style="color:#309;">Mike.lifeguard</b>]]'''&nbsp;&#124;&nbsp;<sup>[[User talk:Mike.lifeguard|<span style="color:#309;">talk</span>]]</sup> 16:35, 7 September 2008 (UTC)
====Related domains====
{{linksummary|caracal-consulting.com}}
{{linksummary|caracal-arms.com}}
{{linksummary|caracal-arms.info}}
{{linksummary|caracal-arms.us}}
{{linksummary|caracal-arms.us.com}}
{{linksummary|caracal-firearms.be}}
{{linksummary|caracal-firearms.com}}
{{linksummary|caracal-firearms.eu}}
{{linksummary|caracal-firearms.fr}}
{{linksummary|caracal-firearms.info}}
{{linksummary|caracal-firearms.net}}
{{linksummary|caracal-firearms.org}}
{{linksummary|caracal-firearms.us}}
{{linksummary|caracal-firearms.us.com}}
{{linksummary|caracal-pistol.biz}}
{{linksummary|caracal-pistol.com}}
{{linksummary|caracal-pistol.eu}}
{{linksummary|caracal-pistol.fr}}
{{linksummary|caracal-pistol.info}}
{{linksummary|caracal-pistol.net}}
{{linksummary|caracal-pistol.org}}
{{linksummary|caracal-pistol.us}}
{{linksummary|caracal-pistol.us.com}}
{{linksummary|caracal.in}}
{{linksummary|caracal.us}}
{{linksummary|caracalarms.com}}
{{linksummary|caracalarms.eu}}
{{linksummary|caracalarms.fr}}
{{linksummary|caracalarms.info}}
{{linksummary|caracalarms.net}}
{{linksummary|caracalarms.org}}
{{linksummary|caracalarms.us}}
{{linksummary|caracalarms.us.com}}
{{linksummary|caracalfirearms.com}}
{{linksummary|caracalfirearms.net}}
{{linksummary|caracalfirearms.us}}
{{linksummary|caracalfirearms.us.com}}
{{linksummary|caracalpistol.com}}
{{linksummary|caracalpistol.info}}
{{linksummary|caracalpistol.us}}
&nbsp;'''&mdash;&nbsp;[[User:Mike.lifeguard|<b style="color:#309;">Mike.lifeguard</b>]]'''&nbsp;&#124;&nbsp;<sup>[[User talk:Mike.lifeguard|<span style="color:#309;">talk</span>]]</sup> 16:35, 7 September 2008 (UTC)


== Troubleshooting and problems ==
== Troubleshooting and problems ==

Revision as of 16:35, 7 September 2008

Shortcut:
WM:SPAM
The associated page is used by the Mediawiki Spam Blacklist extension, and lists strings of text that may not be used in URLs in any page in Wikimedia Foundation projects (as well as many external wikis). Any meta administrator can edit the spam blacklist. There is also a more aggressive way to block spamming through direct use of $wgSpamRegex. Only developers can make changes to $wgSpamRegex, and its use is to be avoided whenever possible.

For more information on what the spam blacklist is for, and the processes used here, please see Spam blacklist/About.

Please post comments to the appropriate section below: Proposed additions, Proposed removals, or Troubleshooting and problems, read the messageboxes at the top of each section for an explanation. Also, please check back some time after submitting, there could be questions regarding your request. Per-project whitelists are discussed at MediaWiki talk:Spam-whitelist. In addition to that, please sign your posts with ~~~~ after your comment. Other discussions related to this last, but that are not a problem with a particular link please see, Spam blacklist policy discussion.

Completed requests are archived (list, search), additions and removal are logged.

snippet for logging: {{sbl-log|1169258#{{subst:anchorencode:SectionNameHere}}}}

If you cannot find your remark below, please do a search for the URL in question with this Archive Search tool.

Spam that only affects single project should go to that project's local blacklist

Proposed additions

This section is for proposing that a website be blacklisted; add new entries at the bottom of the section, using the basic URL so that there is no link (example.com, not http://www.example.com). Provide links demonstrating widespread spamming by multiple users on multiple wikis. Completed requests will be marked as {{added}} or {{declined}} and archived.







Cosmoetica.com



See [1] and [2], a sockfarm of users all adding links to work by Dan Schneider, mostly from his website cosmoetica.com. 120 links cleaned from enWP and a number from other wikis (I'm on it but it's slow as I have to cross-check who added them). Where the links are added by anons, it is a stable subnet. Definitely a candidate for blacklisting on enWP and probably a candidate for meta blacklisting due to cross-wiki issues, albeit fairly limited by comparison wiht the extensive enWP abuse. JzG 21:54, 1 September 2008 (UTC)[reply]

Is this really something we want to be linking to regardless of who added it? Furthermore, it seems from the links you've provided that several domains are involved. Has blacklisting been discussed on enwiki yet?  — Mike.lifeguard | @en.wb 18:22, 2 September 2008 (UTC)[reply]
Yes, it's now blacklisted on enWP (and one of Schneider's socks promptly requested whitelisting). The list of socks is up to about 40 now [3], but primarily an enWP issue. Still, there is some cross-wiki activity and the guy is very determined to use Wikipedia to make his the "The Most Widely Read Interview Series In Internet History!" - to which I cynically respond: {{fact}}. JzG 18:05, 4 September 2008 (UTC)[reply]

sexyunderwear-weddingdress spam

Spam domains







Spam accounts




--A. B. (talk) 21:44, 2 September 2008 (UTC)[reply]

Added Added & thanks again, A. B.  — Mike.lifeguard | @en.wb 00:30, 3 September 2008 (UTC)[reply]


Luna Musik Management, Guzman Construction

Spam domains







Spam account




Alemannic is certainly an odd choice of languages to spam; I suspect he chose it because it was at the top of the list of cross-wiki links for the en:Hip Hop article and then got interrupted.


Reference

--A. B. (talk) 16:22, 3 September 2008 (UTC)[reply]

Added Added. Best to catch things early & these are not useful to our projects.  — Mike.lifeguard | @en.wb 21:22, 3 September 2008 (UTC)[reply]

Franchising spam

{{linksummary|fai.co.in}} {{linksummary|franchise.org.au}} Per luxo:85.121.14.131 and Cross-wiki vandalism report. Perhaps a tad stale. The IP has been reverted since then.  — Mike.lifeguard | talk 14:17, 4 September 2008 (UTC)[reply]

In his contributions I see spamming of




and not the domains you mentioned. Blacklist those? --Erwin(85) 08:40, 5 September 2008 (UTC)[reply]
Yes... Now to find out who is spamming those domains.  — Mike.lifeguard | talk 11:21, 5 September 2008 (UTC)[reply]
Sorry, no useful data in my database .. --Dirk Beetstra T C (en: U, T) 11:24, 5 September 2008 (UTC)[reply]

Added Added. --Erwin(85) 11:39, 6 September 2008 (UTC)[reply]

seoloji.org spam

Spam domains























Spam accounts






--A. B. (talk) 02:48, 5 September 2008 (UTC)[reply]

Added Added. Thanks. --Erwin(85) 08:31, 5 September 2008 (UTC)[reply]


Additional seoloji.org spam

I looked at the spam domains above a bit closer using the WhosOnMyServer tool. Several domains were hosted by commercial web-hosting services and their servers contain hundreds of unrelated domains. I did, however, identify one German server, 89.149.226.124, with a cluster of the spam domains above plus a few other Turkish domains. About half of those domains turned out to be related to the domains I reported above and several had also been spammed:

Also spammed









Other related domains

































Additional spam account


--A. B. (talk) 18:25, 5 September 2008 (UTC)[reply]

Added Added the spammed ones.  — Mike.lifeguard | talk 18:03, 6 September 2008 (UTC)[reply]

ara-bux.com





Please see [4]. (Blocked at Commons) Cirt (talk) 09:55, 5 September 2008 (UTC)[reply]

Added Added. Thanks. --Erwin(85) 10:00, 5 September 2008 (UTC)[reply]
Oh I think the domain is ara-bux.com, not bara... Cirt (talk) 10:01, 5 September 2008 (UTC)[reply]
\b is a special character in regex. It's used to make sure that e.g. \bbar\.com\b matches the spam domain bar.com, but not the good domain foobar.com. --Erwin(85) 10:12, 5 September 2008 (UTC)[reply]
Ah okay thank you. Cirt (talk) 10:27, 5 September 2008 (UTC)[reply]

Proposed additions (Bot reported)

This section is for websites which have been added to multiple wikis as observed by a bot.

Items there will automatically be archived by the bot when they get stale.

Sysops, please change the LinkStatus template to closed ({{LinkStatus|closed}}) when the report is dealt with, and change to ignore for good links ({{LinkStatus|ignore}}). More information can be found at User:SpamReportBot/cw/about

These are automated reports, please check the records and the link thoroughly, it may be good links! For some more info, see Spam blacklist/help#SpamReportBot_reports

If the report contains links to less than 5 wikis, then only add it when it is really spam. Otherwise just revert the link-additions, and close the report, closed reports will be reopened when spamming continues.

The bot will automagically mark as stale any reports that have less than 5 links reported, which have not been edited in the last 7 days, and where the last editor is COIBot. They can be found in this category.

Please place suggestions on the automated reports in the discussion section.

COIBot

Running, will report a certain domain shortly after a link is used more than 2 times by one user on more than 2 wikipedia (technically: when more than 66% of this link has been added by this user, and more than 66% of this link were added XWiki). Same system as SpamReportBot (discussions after the remark "<!-- Please put comments after this remark -->" at the bottom; please close reports when reverted/blacklisted/waiting for more or ignore when good link)

List Last update By Site IP R Last user Last link addition User Link User - Link User - Link - Wikis Link - Wikis
vrsystems.ru 2023-06-27 15:51:16 COIBot 195.24.68.17 192.36.57.94
193.46.56.178
194.71.126.227
93.99.104.93
2070-01-01 05:00:00 4 4

Proposed removals

This section is for proposing that a website be unlisted; please add new entries at the bottom of the section.

Remember to provide the specific domain blacklisted, links to the articles they are used in or useful to, and arguments in favour of unlisting. Completed requests will be marked as {{removed}} or {{declined}} and archived.

See also /recurring requests for repeatedly proposed (and refused) removals.

The addition or removal of a domain from the blacklist is not a vote; please do not bold the first words in statements.


ezinearticles.com



i was about to use http://ezinearticles.com/?MMORPG-Crafting-Skills&id=1383381 as reference for an article, but its blacklisted - is there any special reason? --62.99.197.106 21:22, 28 August 2008 (UTC)[reply]

The reason is here, though I couldn't find the conclusion of that discussion quickly (and the log entry doesn't specify an oldid :\ not sure how that happened).  — Mike.lifeguard | @en.wb 02:24, 29 August 2008 (UTC)[reply]
OK, the full discussion is archived. Given the self-published nature of that domain, and the issues with POV-pushing over a long period of time, I am happy to have this remain on the global blacklist rather than enwiki's local list. You may choose to request whitelisting for a specific use at w:MediaWiki talk:Spam-whitelist.  Declined based on the original report.  — Mike.lifeguard | talk 23:15, 6 September 2008 (UTC)[reply]
For the record, this was cross-wiki spammed. For example (this is just a small sample):
Here are some prior discussions:
--A. B. (talk) 04:11, 7 September 2008 (UTC)[reply]

x.y.z.info

Concerning regexp [0-9]+\.[-\w\d]+\.info/?[-\w\d]+[0-9]+[-\w\d]*\].
A few days ago I removed this entry, but was told afterwards, that every removing needs a de-list discussion. So here I go ([#double/wrong entries|again]).
Short: This entry never worked and does not seem to be needed, so imho it's the best to remove the entry.
Long: In the beginning of 2006 there had been this request, which was added immediately. It was modified some time later. But all versions of the entry never matched anything, because the spamblock extension does not work on link descriptions, but only on the link itself. So there will never be a match on whitespace or square brackets.
Now there are 2 possibilities: 1. fix the regexp or 2. remove it permanently.
The original request said that the urls were something like (integer number).(letter).(name).info (which could perhaps be translated into \d+\.[a-z]\.\w+\.info). But if one looks at the present sbl, one can't see even one entry like this. So probably there's no need to block those domains anylonger. The only possibiliy is that entries like "cinn\.info" and "ephraim\.info" are of this format but were inserted without third-level domains. However, a short look into the history of the sbl discussion does not verify that.
Altogether I suggest to leave the entry removed. -- seth 09:59, 7 September 2008 (UTC)[reply]


Caracal pistol, italian Wiki article





[QUOTE]

Hello Erwin, I would like to know the reason of Caracal info european site being listed on spamlist/blacklist and the removal of the link of the italian article. I am the author of all Wikipedia articles in 16 languages related to the first pistol made in United Arab Emirates known as Caracal pistol and I regularly post the latest news on Caracal-pistol.info website to keep readers informed of the latest developments since day one. Sincerely Edmond HUET Quickload 09:55, 6 September 2008 (UTC)[reply]

Hi, as far as I can see there's only a small amount of information available on the web site. Most links point to your Domains for sale section. That and adding it to multiple wiki's caused me to blacklist it. Feel free to request removal from the blacklist at Talk:Spam blacklist. --Erwin(85) 11:38, 6 September 2008 (UTC)[reply]

Hi, Small amount of information? Maybe, you should click on the 10 buttons on the left when you are on any one of the pages http://www.caracal-consulting.com/caracal-pistol-datas/caracal-pistol-datas.html There is no other site on the web and all the available infos related to Caracal pistol can be found on this site.

[END QUOTE]

Hello, I request removal from blacklist, above quote explains why. Ask for more info if needed. Quickload 09:12, 7 September 2008 (UTC)[reply]

Clearly that domain is not blacklisted at meta.  — Mike.lifeguard | talk 16:23, 7 September 2008 (UTC)[reply]
caracal-pistol.info however is. Given your conflict of interest in this case, the cross-wiki additions and our norm of declining de-listing requests from site owners, this request is  Declined.  — Mike.lifeguard | talk 16:30, 7 September 2008 (UTC)[reply]


OK, given the related domains, and additions by Quickload, I think this may have turned into a request for listing. I'm normally not a fan of listing related domains, but Quickload seems to have a COI here, and is adding sites cross-wiki. Looking for input here. Related domains listed below.  — Mike.lifeguard | talk 16:35, 7 September 2008 (UTC)[reply]

Related domains

















































































 — Mike.lifeguard | talk 16:35, 7 September 2008 (UTC)[reply]

Troubleshooting and problems

This section is for comments related to problems with the blacklist (such as incorrect syntax or entries not being blocked), or problems saving a page because of a blacklisted link. This is not the section to request that an entry be unlisted (see Proposed removals above).

double/wrong entries

when i deleted some entries from the german sbl, which are already listed in the meta sbl, i saw that there are many double entries in the meta sbl, e.g., search for

top-seo, buy-viagra, powerleveling, cthb, timeyiqi, cnvacation, mendean

and you'll find some of them. if you find it useful, i can try to write a small script (in august), which indicates more entries of this kind.
furthermore i'm wondering about some entries:

  1. "\zoofilia", for "\z" matches the end of a string.
  2. "\.us\.ma([\/\]\b\s]|$)", for ([\/\]\b\s]|$) ist the same as simply \b, isn't it? (back-refs are not of interest here)
  3. "1001nights\.net\free-porn", for \f matches a formfeed, i.e., never
  4. "\bweb\.archive\.org\[^ \]\{0,50\}", for that seems to be BRE, but php uses ERE, so i guess, this will never match
  5. "\btranslatedarticles\].com", for \] matches a ']', so will probably never match.

before i go on, i want to know, if you are interested in this information or not. :-) -- seth 22:23, 12 July 2008 (UTC)[reply]

You know, we could use someone like you to clean up the blacklist... :D Kylu 01:53, 13 July 2008 (UTC)[reply]
We are indeed interested in such issues - I will hopefully fix these ones now; keep 'em coming!  — Mike.lifeguard | @en.wb 01:59, 13 July 2008 (UTC)[reply]
Some of the dupes will be left for clarity's sake. When regexes are part of the same request they can be safely consolidated (I do this whenever I find them), but when they are not, it would be confusing to do so, in many cases. Perhaps merging regexes in a way that is sure to be clear in the future is something worth discussing, but I can think of no good way of doing so.  — Mike.lifeguard | @en.wb 02:06, 13 July 2008 (UTC)[reply]
in de-SBL we try to cope with that only in our log-file [5]. there one can find all necessary information about every white-, de-white-, black- and de-blacklisting. the sbl itself is just a regexp-speed-optimized list for the extension without any claim of being chronologically arranged.
i guess, that the size of the blacklist will remain increasing in future, so a speed-optimazation perhaps will be necessary in future. btw. has anyone ever made any benchmarks of this extension? i merely know that once there had been implemented a buffering.
oh, and if one wants to correct further regexps: just search by regexps (e.g. by vim) for /\\[^.b\/+?]/ manually and delete needless backslashes, e.g. \- \~ \= \:. apart from that the brackets in single-char-classes like [\w] are needless too. "\s" will never match. -- seth 11:36, 13 July 2008 (UTC)[reply]
fine-tuning: [1234] is much faster in processing than (1|2|3|4); and (?:foo|bar|baz) is faster than (foo|bar|baz). -- seth 18:21, 13 July 2008 (UTC)[reply]
I benchmarked it, (a|b|c) and [abc] had difference performance. Same with the latter case — VasilievV 2 21:02, 14 July 2008 (UTC)[reply]
So should we be making those changes? (ie was it of net benefit to performance?)  — Mike.lifeguard | @en.wb 21:56, 15 July 2008 (UTC)[reply]
these differences result from the regexp-implementation. but what i ment with benchmarking is the following: how much does the length of the blacklist cost (measured in time)? i don't know, how fast the wp-servers are. however, i benchmarked it now on my present but old computer (about 300-500MHz):
if i have one simple url like http://www.example.org/ and let the ~6400 entries of the present meta-blacklist match against this url, it takes about 0,15 seconds till all regexps are done. and i measured really only the pure matching:
// reduced part of SpamBlacklist_body.php
foreach($blacklists as $regex){
  $check = preg_match($regex, $links, $matches);
  if($check){
    $retVal = 1;
    break;
  }
}
so i suppose, that it would not be a bad idea to care about speed, i.e. replace unnecessary patterns by faster patterns and remove double entries. ;-)
if you want me to, i can help with that, but soonest in august.
well, the replacement is done quickly, if one of you uses vim
the replacement of (.|...) by [...] can be done manually, because there are just 6 occurrences. the replacement of (...) by (?:...) can be done afterwards by
:%s/^\([^#]*\)\(\\\)\@<!(\(?\)\@!/\1(?:/gc
-- seth 23:26, 15 July 2008 (UTC)[reply]
some explicit further bugs:
\mysergeybrin\.com -> \m does not exist
\hd-dvd-key\.com -> \h does not exist
however, because nobody answered (or read?) my last comment... would it be useful to give me temporarily the rights to do the modifications by myself? -- seth 01:44, 7 August 2008 (UTC)[reply]
I fixed these. You can always request (temporary) sysop status. Any help is appreciated. --Erwin(85) 12:45, 7 August 2008 (UTC)[reply]
requested and got it. :-) -- seth 09:18, 13 August 2008 (UTC)[reply]

before i start modifying the list, a want to know, whether i should log my changes somewhere. oh, and btw. i suppose that the entry [0-9]+\.[-\w\d]+\.info\/?[-\w\d]+[0-9]+[-\w\d]*\] is somehow senseless, for it will probably never match. i found the original discussion [6] (the regexp was changed afterwards), but the regexp will not grep the links mentioned there. shall i just delete such an entry or shall a make a new request and try to correct it? -- seth 09:18, 13 August 2008 (UTC)[reply]

It would be nice if you could update the log as well, so we can still find the corresponding log message. Though maybe we should wait and see if anything new comes out of #The Logs. I guess it's best to correct wrong entries or in any case log all those removals. It probably wouldn't hurt if some were removed, but I have no idea how many entries we're talking about. --Erwin(85) 09:31, 13 August 2008 (UTC)[reply]
ok, so i'll wait until the other thread is finished. but i don't think, that a manipulating of the logs is a good idea, because this will make tracing of entry changes difficult.
i guess, there are less than 10, perhaps even less than 5 useless entries. -- seth 10:29, 13 August 2008 (UTC)[reply]
i cleaned up the sbl two days ago. until now i did not delete any entries (except for grouping purposes). and i could not correct the entry "\bnstpi\.com\.my/ client" (with a senseless space) because its diff wasn't very meaningful. perhaps somebody knows something about this entry and could tell it.
however, one question is: shall i really modify the wrong entries in the logs, too? it is like changing history, so it could cause irritations. -- seth 08:48, 26 August 2008 (UTC)[reply]
http://meta.wikimedia.org/w/index.php?title=Spam_blacklist&diff=1147562&oldid=1146869 which added the question marks, also blocked legitimate sites. For example chabad(east|usa|world)\.(am|com|org) and chabad\.am became chabad(?:east|usa|world)?\.(?:am|com|org) which blocked legitimate such as chabad.com and chabad.org. A solution may be to remove the question marks for this entry and restore it to 2 entries like it was before. --PinchasC 14:20, 1 September 2008 (UTC)[reply]
Done - Regex is now chabad(?:east|usa|world)\.(?:am|com|org), and should block what it's supposed to now.  — Mike.lifeguard | @en.wb 15:39, 1 September 2008 (UTC)[reply]
oops, sorry for my mistake. PinchasC is right. additional to Mike.lifeguard's correction i will re-insert the explicite entry chabad\.am. -- seth 08:26, 2 September 2008 (UTC)[reply]
"\bnstpi\.com\.my/ client": after looking at the request on the TP and the links mentioned there, i suppose, that the leading " client" could just be ignored, so i deleted it. otherwise the regexp would be totally useless. -- seth 10:13, 7 September 2008 (UTC)[reply]

what does "let's not use ?: - it makes COIBot unhappy[...]"[7] mean precisely? -- seth 23:55, 27 August 2008 (UTC)[reply]

Beetstra can tell you exactly, as he is the bot's owner. I believe it choked on that as it isn't handled properly in Perl. Also some of the very long regexes caused issues (but didn't change those). I am having second thoughts about consolidating regexes which are not part of the same request. Regexes added together can be mushed together easily, but those in separate requests should likely stay separate, I think. Not sure what to do next about this though.  — Mike.lifeguard | @en.wb 23:59, 27 August 2008 (UTC)[reply]
COIBot: well, perl could cope with non-capturing patterns /(?:foo)/ long before php even existed, so i guess it isn't really a perl-problem. i'll ask Beetstra on his talk page about that.
grouping: as far as i can see, the sbl-page can be used for blocking only. all relevant blocking information is listed in the log (and the links mentions there). so i don't see, how even a random sort on the sbl entries combined with randomly grouped regexps could harm. -- seth 01:49, 28 August 2008 (UTC)[reply]

double entries

I wrote a small script to grep most of the double (or multi) entries. The result is presented on User:Lustiger_seth/sbl_double_entries. As you can see, there are many (>250) redundant entries. I guess, we could delete more than 200 entries. -- seth 22:59, 19 August 2008 (UTC)[reply]

moved a discussion to previous thread. -- seth 08:26, 2 September 2008 (UTC)[reply]
So, as we now log removals too, I will delete double entries, if nobody raises objections. -- seth 19:17, 3 September 2008 (UTC)[reply]

done. some additional comments on deleted entries, which were not exactly double:

\.rr\.nu             # deleted, although it is not fully superseded by \brr\.nu\b, but almost. i guess that the domain .nu was meant, so the postfix "\b" is ok.
caiquecrazy\.us\.tt  # almost fully superseded by \bu[ks]\.tt\b
\.6url\.com          # almost fully superseded by \b6url\.com\b
\.flingk\.com        # almost fully superseded by \bflingk\.com\b
\.metamark\.net      # almost fully superseded by \bmetamark\.net\b
\.paulding\.net      # almost fully superseded by \bpaulding\.net\b
\.shorl\.com         # almost fully superseded by \bshorl\.com\b
\.shortlinks\.co\.uk # almost fully superseded by \bshortlinks\.co\.uk\b
\.simurl\.com        # almost fully superseded by \bsimurl\.com\b
\.smcurl\.com        # almost fully superseded by \bsmcurl\.com\b
\.tighturl\.com      # almost fully superseded by \btighturl\.com\b
\.yatuc\.com         # almost fully superseded by \byatuc\.com\b
\.yep\.it            # almost fully superseded by \byep\.it\b
\.ontheweb\.nu       # almost fully superseded by \bontheweb\.nu\b
\.isgre\.at          # almost fully superseded by \bisgre\.at\b
drugs\.isgre\.at     # same as above
\.byinter\.net       # almost fully superseded by \bbyinter\.net\b
drugs\.byinter\.net  # same as above
nigeria\.tz4\.com    # almost fully superseded by \btz4\.com\b
\binternet-history\.tz4\.com # same as above
\.edom\.co\.uk       # almost fully superseded by \bedom\.co\.uk\b
\.fw\.nu             # almost fully superseded by \bfw\.nu\b
\.redirect\.hm       # almost fully superseded by \bredirect\.hm\b
drugs\.passingg\.as  # almost fully superseded by \bpassingg\.as\b
\.shop\.tc           # almost fully superseded by \b(?:au|es|hk|hu|ie|it|kr|mx|pl|se|th|ua|us|shop)\.tc\b
\.explode\.to        # almost fully superseded by \bexplode\.to\b
\.zwap\.to           # almost fully superseded by \bzwap\.to\b
squidoo\.com/inexpensive-wine  # almost fully superseded by \bsquidoo\.com\b
squidoo\.com/localphoneservice # same as above
\bsearchtravel\.biz/countrylist/italy.php # almost the same as \bsearchtravel\.biz/countrylist/italy\.php\b

-- seth 12:13, 5 September 2008 (UTC)[reply]

nimp.org

Just noticed that we are blocking only

wikipedia\.on\.nimp\.org
\.on\.nimp\.org
\bblocked\.on\.nimp\.org\b

when we might as well block the whole thing:

\bnimp\.org\b

 — Mike.lifeguard | @en.wb 01:44, 3 September 2008 (UTC)[reply]

User: namespace abuse

Nervenhammer



similar pattern, adding a personal link..--Cometstyles 12:01, 12 August 2008 (UTC)[reply]
Thanks Comets - Added Added for now. In passing I see no harm in listing such sites as much to send a message to the user that their behaviour may not be appropriate. Not sure about how lasting teh listing should be our logging immediately - thoughts welcome. --Herby talk thyme 12:12, 12 August 2008 (UTC)[reply]
Reviewing this it may well be a good faith de user who has just decided to expand there interests (based on SUL info). In which case I suggest serious consideration for de-listing if we are asked. --Herby talk thyme 12:17, 12 August 2008 (UTC)[reply]
Hi guys, I don't understand this, why is my personal website Nervenhammer on this blacklist? Fleshgrinder 09:53, 22 August 2008 (UTC)[reply]
Adding the link to your userpage on many wikis where you are not a community member is generally frowned upon. I suggest you instead leave a link to your userpage on your home wiki if you need to create a userpage. If you are an established community member, you would be afforded more leeway with respect to user page content. I'm prepared to de-list this on the condition that the link is not added cross-wiki again.  — Mike.lifeguard | @en.wb 14:20, 31 August 2008 (UTC)[reply]
Okay, I'm very sorry about that, it was never my intention to start link building for my website - I only wanted to show my person and what I do. It won't happen again. If I'm not really contributing something I don't create a userpage and if, I set a link to the German Wikipedia (where I'm contributing the most). Thank you for the answer and for elucidating me about this issue. It would be nice if you would enlist the URI, because I don't want my URI to be on a blacklist and I'm definitly not going to post the address again. Kindest regards --Fleshgrinder 09:48, 2 September 2008 (UTC)[reply]

Removed Removed  — Mike.lifeguard | @en.wb 14:51, 2 September 2008 (UTC)[reply]

Autofinance



Cross wiki spam pages. (autofinance-ez.com is the domain). --Herby talk thyme 12:59, 13 August 2008 (UTC)[reply]



Added Added  — Mike.lifeguard | @en.wb 17:18, 1 September 2008 (UTC)[reply]

Bestlyriccollection



What is that? I stumbled into it when 84.109.83.73 was vandalizing through the wikis. Best regards, --birdy geimfyglið (:> )=| 10:41, 14 August 2008 (UTC)[reply]

Very odd indeed. fr wp didn't like the idea of a "user page for bookmarks". Not sure that it is spam but sure doesn't look like "normal" user pages. Looking some more & other opinions would be good. --Herby talk thyme 11:02, 14 August 2008 (UTC)[reply]
They have a point [8]... I don't understand why he needs that in multiple wikis, I mean, if he (miss)uses his userpage for bookmarks, why on many places, --birdy geimfyglið (:> )=| 12:25, 14 August 2008 (UTC)[reply]

Jon Awbrey‎ and JonAwbrey‎





Creates userpages full of external links (and selfpromotion references?) on many wikis. Annabel 19:08, 28 August 2008 (UTC)[reply]

I placed the same vita on my user page that I use on all the sites where I contribute work and discuss ideas with other interested parties. This does not constitute SPAM (= "unsolicited mass-mailing or posting") in any technical or COI sense of the word. I would appreciate the two variants of my real name that I use on the Internet and Web not being listed on any kind of badlists. Thank you, Jon Awbrey 19:12, 29 August 2008 (UTC)[reply]
While it may not be spam, it would seem to be abuse of WMF wikis & as such unwanted. While community members are given leeway with their userpages, such excessive linking is generally frowned upon. Furthermore, I very much doubt you understand all the languages you have posted this to, nor are you active in those wikis. I invite you to fix the problem before it is done for you. The history at enwiki will be of interest to others reviewing this.  — Mike.lifeguard | @en.wb 19:36, 29 August 2008 (UTC)[reply]
I would appreciate it if you could point to the relevant WMF Terms of Service, or even a generally accepted standard of etiquette that would justify your calling this user page vita an "Abuse". I am referring to the one now posted here at Meta, which is a copy of the one deleted by Annabel from my Nederlands User Page. By "generally accepted standard of etiquette" I mean one that you could honestly assure me is followed across the board on all WMF User Pages. In addition, I have never seen any notice of Wikipedias being "Encyclopedias that anyone who is fluent in the local language can edit" — but please let me know if I have missed such a restriction somewhere. Jon Awbrey 20:22, 29 August 2008 (
You misunderstand me crucially. I do not say you need to be fluent in the languages where you contribute. To claim that would be hypocritical; I edit all WMF wikis. The issue is that:
  1. You are not an established member of the community on any wiki where you have a userpage (so far as I can tell).
  2. Your userpage has an excessive amount of links (indeed, links form the only content, and they appear to be placed for self-promotional purposes). This would perhaps be an issue regardless of the above.
 — Mike.lifeguard | @en.wb 20:31, 29 August 2008 (UTC)[reply]

[Undent]: Correct me if I am wrong, but I do not think it is customary for newcomers to any of the many-tongued Wikipædiæ to be subjected to the ordeals of this type of entrance exam with regard to the legitimacy of their participation. However, By FYIing my real name, educational background, and ongoing intellectual interests, I have certainly done more than the avarage Anon IP on that score.

Many people post pics on their user pages as a way of providing a friendly introduction to themselves, their current interests, and their personal histories. My old web vita harks back to a day when I was unsure about the propriety of copying pics, so I used links instead, over the years being forced to replace many of them with WayBak links. You can hardly dream that I am collecting revenue off archival links like that, can you?

If and when you personally discover an interest in some of the Active Suggestions Concerning Intellectual Interchange that I enumerated in my web vita — which was my sole purpose in posting it to my NL User Page — then we may find more interesting things to talk about. In the mean time, I can hardly become an "established member of the community on any wiki", much less learn a few bits of the local colour and language, if some Admin deletes my self-introductory user page and blocks my account after the first few edits, now can I? Jon Awbrey 23:45, 29 August 2008 (UTC)[reply]

  • Jon, this same sort of Wikilawyering nonsense is what got you banned from enWP and booted from the mailing list. Obviously your rampant sockpuppetry and disruption ensures you remain banned on enWP. I would be the first to help you if you wanted your massive list of socks associated with some other name, to reduce the impact on you, but I don't see why we should help you to pretend that you are here to do anything other than the usual: self-promotion and idiosyncratic original research. JzG 20:50, 4 September 2008 (UTC)[reply]
Still placing pages - en wq in the past few hours. Cheers --Herby talk thyme 08:00, 6 September 2008 (UTC)[reply]
This is shameless self-promotion, and I would suggest that someone who has the necessary rights removes the pages from all projects on which he is not an active participant. JzG 11:44, 7 September 2008 (UTC)[reply]

So, the following links are the ones being used for vanity spamming here:

planetmath.org/encyclopedia/DifferentialPropositionalCalculus.html
www.mywikibiz.com/Directory:Jon_Awbrey
www.mywikibiz.com/User:Jon_Awbrey
www.mywikibiz.com/User_talk:Jon_Awbrey 
http://planetmath.org/?op=getuser&id=15246
knol.google.com/k/-/-/3fkwvf69kridz/1
mathforum.org/kb/accountView.jspa?userID=99854
www.mathweb.org/wiki/User:Jon_Awbrey
www.mathweb.org/wiki/User_talk:Jon_Awbrey
www.research.att.com/~njas/sequences/?q=Awbrey
www.p2pfoundation.net/User:JonAwbrey
www.p2pfoundation.net/User_talk:JonAwbrey
altheim.4java.ca/ceryle/wiki/Wiki.jsp?page=JonAwbrey
forum.wolframscience.com/member.php?s=&action=getinfo&userid=336
www.wikinfo.org/index.php/User:Jon_Awbrey
www.wikinfo.org/index.php/User_talk:Jon_Awbrey
www.getwiki.net/-User:Jon_Awbrey
www.getwiki.net/-UserTalk:Jon_Awbrey
ontolog.cim3.net/cgi-bin/wiki.pl?JonAwbrey
semanticweb.org/wiki/User:Jon_Awbrey
semanticweb.org/wiki/User_talk:Jon_Awbrey
wikipediareview.com/index.php?showuser=5619
wikipediareview.com/index.php?showuser=398
zh.wikipedia.org/wiki/User:Jon_Awbrey
zh.wikipedia.org/wiki/User_talk:Jon_Awbrey
org.sagepub.com/cgi/content/abstract/8/2/269
www.cspeirce.com/menu/library/aboutcsp/awbrey/integrat.htm
www.chss.montclair.edu/inquiry/fall95/awbrey.html
www.abccommunity.org/tmp-a.html
www2.oakland.edu/secs/dispprofile.asp?Fname=Fatma&Lname=Mili
www2.oakland.edu/secs/dispprofile.asp?Fname=Mohamed&Lname=Zohdy
www2.oakland.edu/oakland/ouportal/index.asp?site=87
www.msu.edu/dig/msumap/psychology.html
www.msu.edu/dig/msumap/beaumont.html
quod.lib.umich.edu/cgi/i/image/image-idx?id=S-BHL-X-BL001808%5DBL001808
www.uiuc.edu/navigation/buildings/altgeld.top.html 
www.mth.msu.edu/images/wells_medium.jpg 
www.msu.edu/dig/msumap/phillips.html
www.enolagaia.com/JMC.html

Discussion

NOINDEX

Prior discussion at Talk:Spam_blacklist/Archives/2008/06#Excluding_our_work_from_search_engines, among other places

There is now a magic word __NOINDEX__ which we can use to selectively exclude certain pages from being indexed. I suggest having the bots use this magic word in all reports generated immediately. Whether to have this page and it's archives indexed was a point of contention previously, and deserves further discussion.  — Mike.lifeguard | @en.wb 01:33, 4 August 2008 (UTC)[reply]

Sorry missed this one. I certainly support the "noindex" of the bot pages. They are somewhat speculative. If we could get the page name changed I would be happier about not using the magic word on this but..... --Herby talk thyme 16:09, 6 August 2008 (UTC)[reply]
I have added the keyword to the COIBot generated reports, they should now follow that. --Dirk Beetstra T C (en: U, T) 16:31, 6 August 2008 (UTC)[reply]
My bot is flagged now, so I can start adding it to old reports. I will poke a sysadmin first to see if I really must make ~12000 edits before I start though. It will not be all in one go, and I will not start for a day or two.
Any other thoughts on adding it to this page and/or it's archives?  — Mike.lifeguard | @en.wb 18:11, 9 August 2008 (UTC)[reply]
Already sort-of done with {{linkstatus}}, so the bot probably won't run. I plan to keep the flag though <evil grin>  — Mike.lifeguard | @en.wb 22:56, 11 August 2008 (UTC)[reply]

Renaming the blacklist should be done at some point in the future; we'll have to wait on Brion for that. Until then, I'd like to have this page and it's archives __NOINDEX__ed. Having it indexed causes more issues than it solves & we now have an easy way to remedy the situation. We should review this when the blacklist is renamed.  — Mike.lifeguard | @en.wb 02:55, 14 August 2008 (UTC)[reply]

The Logs

log system

I would like to consolidate our logs into one system which uses subpages and transclusions to make things easy. Each month would get a subpage, which is then transcluded onto Spam blacklist/Log so they can easily be searched. This would mean merging Nakons "log entries" into the main log, and including the pre-2008 log. This wouldn't require much change in how we log things.

However, I wonder what people think about also logging removals and/or changes to the regexes. Currently, we don't keep track of those in any systematic way, but I think we should. For example, I consolidated a few regexes a while back, and simply made the old log entries match the new regexes, which is rather Orwellian. Similarly, we simply remove log entries when we remove domains - nothing is added to the log, so we cannot track this easily. This idea (changing the way we log things) is likely going to require some discussion; I don't think there should be any problem moving to transcluded subpages immediately.

 — Mike.lifeguard | @en.wb 14:41, 6 August 2008 (UTC)[reply]

I'm all for using one system for the logs. I'm not sure about your second idea though. Is the log intended purely to explain the current entries or also former entries and perhaps even edits? Logging removals would be a good idea to see if a domain was once listed, but logging changes seems too bureaucratic. Matching the log entries with the new regexes might be Orwellian, but it's also pragmatic. What are the advantages of logging changes? Could you perhaps give an example of how you suggest to log changes? --Erwin(85) 18:16, 6 August 2008 (UTC)[reply]
I should say I mean "Orwellian" without the connotative value. The denotative value is simply that the current method is "changing history" - not in and of itself a bad thing. Indeed, I've had no issues with this, hence the speculative nature of that part of my suggestion.  — Mike.lifeguard | @en.wb 19:48, 6 August 2008 (UTC)[reply]
in de:WP:SBL we do log all new entries, removals and changes on black- and whitelists. logging changes can be useful e.g. for retracing old discussions. -- seth 01:35, 7 August 2008 (UTC)[reply]
i think, that the transclusions are a good idea to keep the traffic low. is anybody against that?
concerning the logging of removals/modifications: what do you think about a log system like de:Wikipedia:Spam-blacklist/log#Mai_2008? -- seth 12:12, 13 August 2008 (UTC)[reply]
It would be quite some work to link the diffs, but I'm not against using it. I guess that means this is a weak support. --Erwin(85) 09:35, 19 August 2008 (UTC)[reply]

if everyone else continues ignoring the suggestions till tomorrow, i will start realizing Miks.lifguards idea by creating subpages like

apart from that i'd like to know...

  1. which components/tools are dependent on the sbl-log-syntax/-format?
  2. am i right, that there is no meta-whitelist? will there ever be one?
  3. would it be ok to switch from the old log-syntax to a new one, whithout converting the old log-entries?

-- seth 10:23, 23 August 2008 (UTC)[reply]

Please use subpages; I changed your examples above. There is no global whitelist, no. But in the future? Perhaps something to request. I imagine leaving old logs will be fine. Are we sure we want to log changes to regexes? I'm not sure whether that's really necessary. It also raises the already-high bar to contributing in this area. Our procedures are opaque enough as it is - this is one more hoop we are making potential recruits to the anti-spam team jump through.  — Mike.lifeguard | @en.wb 22:12, 23 August 2008 (UTC)[reply]
whitelist: i guess, a global whitelist would not be necessary, because blacklist entries usually can be modified by plain regexp-syntax to match all example.org except example.org/good. such a blacklist entry would be
example\.org(?!/good)
however, there may be cases, where a explicite whitelist entry would be better human-readable.
leaving old logs: if removals shall be logged, how shall they be logged? just by comment?
log changes: the main reasons why i am asking are #double.2Fwrong_entries and #double entries. if it was ok to remove bugs, syntax-optimizations and double entries without logging it, it would be less work for me. ;-) -- seth 22:50, 23 August 2008 (UTC)[reply]
I guess so, but if you see something suspicious please check if it can really be removed. Using the new syntax is OK with me. Logging removals like on dewiki looks good. --Erwin(85) 10:19, 24 August 2008 (UTC)[reply]

at least the splitting is done. [9] -- seth 09:37, 25 August 2008 (UTC)[reply]

Thanks for taking care of the logs; I think that will work much better.
I'm not sure whether I'm happy with having regexes consolidated as you've done. Within each set of additions, one should try to be concise with your regexes, but I don't think merging all the blogspot ones together is necessarily a good idea. This will make future removals more difficult. In case you forget, not all are as proficient with regex as you, myself included!  — Mike.lifeguard | @en.wb 02:14, 29 August 2008 (UTC)[reply]
first of all: i did not merge blogspot entries. the very long blogspot line had existed before my "big" edit. ;-)
merging all blogspot-links in one line would probably be not a good idea, because of performance reasons (the extension builds the regexps in 4k-blocks) and because of COIBot, which now allows a maximum line length of 1k chars.
(not to be misunderstood: grouping regexps increases performance, but lines >1k will lead to problems)
i grouped only a few regexps and only if they were "near" together in the SBL and had no different headings. as regexp grouping is used already, i didn't think that would be difficult to read. the largest grouping i did at the beginning and in lines 3300-3500, see [10]. was that too much?
concerning the logging: afaics we all want to log removals, too, don't we? but if i didn't get you wrong, you don't want to change the log-syntax. so i don't understand how you want SBL removals to be logged? :-) -- seth 09:43, 29 August 2008 (UTC)[reply]
My mistake on the blogspot one then. I've said nothing about not changing the log format - feel free to do so in order to log both additions and removals - the template you would want to change is {{sbl-log}} and the "snippet" at the top of this page.  — Mike.lifeguard | @en.wb 18:01, 29 August 2008 (UTC)[reply]
oh, ok. i missunderstood "I imagine leaving old logs will be fine." -- seth 09:50, 31 August 2008 (UTC)[reply]
Afaics sbl-log does not need to be changed. To keep the log syntax somehow downwards compatible, it will suffice to change the syntax like this:
example\.org # name # b+ reason
where "b+" means addition on blacklist, "b-" means removal.
To keep the format more compact we could use the dewiki-style
example\.org # [SBL-diff b+] # reason
which results in something like
example\.org # b+ # reason
But this is a bit more work for the admins and gives just a small additional information (the exact date of addition/removal), so I don't know whether this is really better. Although Erwin said, the dewiki-syntax looked good and Mike.lifeguard told me to feel free, I'm not sure, if any other admin will beat me, if I change the syntax to dewiki-style. :-) -- seth 11:21, 31 August 2008 (UTC)[reply]
However, i've been bold. Now we have same syntax as dewiki. -- seth 15:10, 2 September 2008 (UTC)[reply]
I adapted COIBot in the XWiki reports, it now (should) say(s) (have to wait for the next report from nowdiff):
 \bexample\.org             # [SBL-diff b+] # see [[User:COIBot/XWiki/example.org]]
with (hopefully) the first # at position 40 (may have miscalculated that). Replace SBL-diff by the diff and save. It is going to be more work, but well, it is also clearer from now what happens. --Dirk Beetstra T C (en: U, T) 15:33, 2 September 2008 (UTC)[reply]
Can we please keep the admin's name (and the span which was there previously)? Furthermore, when someone is using the log snippet at the top of this page, it will follow the old format.  — Mike.lifeguard | @en.wb 16:59, 2 September 2008 (UTC)[reply]
I've added a snippet for logging on the actual blacklist. Take the snippet after you make an edit.
So, to log an addition, grab the snippet from this page and the snippet from the blacklist page.
For additions, use {{sbl-log|1161258#{{subst:anchorencode:Example}}}} {{sbl-diff|1161261}}
which produces request addition
For removals, use {{sbl-log|1161258#{{subst:anchorencode:Example}}}} {{sbl-diff|1161261|removal}}
which produces request removal
This should make it faster to log things, I think.  — Mike.lifeguard | @en.wb 17:27, 2 September 2008 (UTC)[reply]
OK, changed it back .. we are not sure about this implementation yet (for me, it does give extra work, and IMHO does not add, a simple '+' or '-' in the logs without the actual difflink should suffice .. )? --Dirk Beetstra T C (en: U, T) 17:34, 2 September 2008 (UTC) (forgot to sign)[reply]
Actually the admins name is redundant, because it is included in the diff. If all (even redundant) information is provided (like now), it makes a lot of work for the admins.
The additional information provided by the difflink is quite small (i.e. exact modification date). The simple '+'/'-' (or 'b+'/'b-') would be enough. (That's why I was asking a few lines above.) The difflink would be fully superfluous, if all admins used the edit summary line of the sbl to inform about the added/removed entry explicitly, but that it unrealistic, I know.
So which syntax shall be used? Afaics its main features must be: 1. provide important information, 2. easy to input for admins, and 3. not too hard to read for machines. I guess all above suggestions will do, so it doesn't really make a big difference, which one will be chosen.
I guess, if nodody answers, we will just continue like now. -- seth 08:22, 3 September 2008 (UTC)[reply]
The admin's name isn't redundant - that is information we will want without having to look at the diff. By that logic, we would also not have whether it was an addition or removal, since that is information contained in the diff 0.o  — Mike.lifeguard | @en.wb 14:20, 3 September 2008 (UTC)[reply]
The '+'/'-' is redundant, right. But it is a main information about the sbl modification. The admin's name is imho not so important. But we don't need to discuss about that small point. For me the current syntax is no problem. :-) -- seth 19:12, 3 September 2008 (UTC)[reply]

tool for log searching

The simpliest way to improve searchability is to write a tool that searches the logs for you. I'm in the middle of doing so, and I'll have a working prototype in a few days. The way this would work is it would load all the pages (really does not matter where the pages are), and apply a few regex to them. This means we really don't have to merge nacon's stuff, I can just add that page to the tool. As long as the logs keep the same pattern of one entry per line, a tool is not difficult.

I don't really think logging removals is smart, we never remove entries from the logs anyway. Simpliest way is to keep the logs write only (only new entries), and have a tool list all matches. (I'm writing the tool in a manner where you will be able to put the domain in "plain", as in google.com, and it will find all the relevant entries, even if it has \bgoogle\.com\b, or some other weirdness. —— nixeagle 20:23, 6 August 2008 (UTC)[reply]

lol, by accident i started writing a similar tool 2 hours ago. but i write a cli-perl-script only. until now it greps all sbl-entries (in meta-blacklist, de-blacklist and de-whitelist), which would match a given url. -- seth 01:35, 7 August 2008 (UTC)[reply]
Seth, nixeagle: actually, having a tool that searches all blacklists and logs (i.e. cross-wiki) to see if it is blacklisted somewhere, and if there is a log for that would be great. IMHO, it should be 'easy' to write a tool that extracts all regexes from the page, and tries if it is possitive against a certain url that we search (and it could then be incorporated into the {{linksummary}} to easily find it ..). Or is this just what you guys are working on ;-) .. --Dirk Beetstra T C (en: U, T) 09:50, 7 August 2008 (UTC)[reply]
beta version. :-) -- seth 14:38, 7 August 2008 (UTC)[reply]
WONDERFUL!
one question, can you make it add 'http://' by itself (as we only put the domain in the linksummary as to prevent the blacklist to block it ..). --Dirk Beetstra T C (en: U, T) 14:48, 7 August 2008 (UTC)[reply]
Thats about what I was writing. I was putting it in the framework of http://toolserver.org/~eagle/spamArchiveSearch.php where the tool retrieves the section/page and links you directly to where the item was mentioned. For logs I was working on displaying the line entry in the log as one of the results, so you would not even have to view the log page. —— nixeagle 15:11, 7 August 2008 (UTC)[reply]
if you want to combine my script with that framework, i can give you the source code. but it is perl-code and it is ugly, about 110 lines. -- seth 17:00, 7 August 2008 (UTC)[reply]
some pages

Suggestion for pages:

Thanks! --Dirk Beetstra T C (en: U, T) 14:48, 7 August 2008 (UTC)[reply]

i had to cope with a bug in en-sbl. but now it seems to work. further suggestions? (the more lists i include, the slower the script will get.)-- seth 16:44, 7 August 2008 (UTC)[reply]
I would suggest to do it progressive, first meta and en blacklist, the rest later (roughly in order of wiki-size), similar to luxo does. --Dirk Beetstra T C (en: U, T) 17:07, 7 August 2008 (UTC)[reply]
i used a hash, and those don't care about the order of declaration. now it should be sorted. -- seth 22:11, 7 August 2008 (UTC)[reply]

User page advertising

Another "thinking aloud" one!

I guess I come across a commercial orietated user page on Commons once a day on average. The past week has bought a "Buying cars" page, an "Insurance sales" page, a "Pool supplies" page as well as blog/software/marketing pages. I do usually run vvv's SUL tool but quite often there is nothing immediatly (the Pool suplies one cropped up on en wp a couple of days after Commons). I know en wp are often reluctant to delete such pages out of hand (which I find incredible).

I think I am probably saying should we open up a section here to allow others to watch/comment/block/delete or whatever across wikis? --Herby talk thyme 09:51, 10 August 2008 (UTC)[reply]

I agree, this is a great idea, as I have also noticed spammers like this go cross-wiki to multiple projects (Wikinews/Commons, etc.) Cirt 11:40, 10 August 2008 (UTC)[reply]
Agree. Others may be interested in watching only that part of our work - perhaps a transcluded subpage so it may be watched separately?  — Mike.lifeguard | @en.wb 14:03, 10 August 2008 (UTC)[reply]
Sounds like the best way to proceed. Cirt 14:09, 10 August 2008 (UTC)[reply]
Thanks so far - good to get other views as well but as an idea of the scale I picked these out from the last few days on Commons (all user names) -
Sungate - design advert
Totalpoolwarehouse - obvious & en wp too
Theamazingsystem - two spamvert pages "The Automated Blogging System is a Powerful SEO Technology"
Adventure show - pdf spam file
Firmefront - fr "Banque, Assurance, Gestion Alternative et Private Equity"
The Car Spy - internet car sales
DownIndustries - clothing sales
Serenityweb1 - Nicaragua tourism & en wp
Macminicover - "Dust Cover or Designer Cover for Apple Mac"
I can't instantly find the insurance sales one & I am sure another user produced a page the same as Theamazingsystem. We could do with working out the best way of presenting the info - whether the standard template is needed or whether just an SUL link would allow us a quick check on cross wiki activity?
It would be good to know if the COI bot excludes User: space and whether that may need rethinking?
Cheers --Herby talk thyme 14:35, 10 August 2008 (UTC)[reply]
So far as I know, it watches only the mainspace. But Beetstra above said this could be changed.  — Mike.lifeguard | @en.wb 14:40, 10 August 2008 (UTC)[reply]
Not sure what else is in the works but I think an SUL link to check activity cross-projects would be sufficient. Anything else would be above and beyond but would also be nice. Cirt 15:06, 10 August 2008 (UTC)[reply]
The standard {{ipsummary}} template is pretty good but (I think) lacks the SUL link which for this kind of stuff would be useful (luxo would be a help tho I guess).
The other thing I guess would be to get agreement to lock the blatantly commercial accounts just so that they do not do a "JackPotte" on us I think. I'll maybe point a couple of people to this section. --Herby talk thyme 16:04, 10 August 2008 (UTC)[reply]
As it happens I was just trying to lock an account that wasn't SUL yet. I think the concept is sound, these accounts prolly should be locked and hidden. Not sure about mechanics of implementation. ++Lar: t/c 18:39, 10 August 2008 (UTC)[reply]
IPs can't have a unified account, so the SUL tool is useless. We have luxo's for that.  — Mike.lifeguard | @en.wb 16:49, 10 August 2008 (UTC)[reply]
Yeah - this type really needs an SUL link I think. And we do nee to look at the best way we can lock overtly commercial accounts I think. --Herby talk thyme 16:51, 10 August 2008 (UTC)[reply]
Today I also saw some spamming, by 3 accounts on Commons:Talk:Main Page, I have to say that I do agree with Herby about this here, really a nice idea on how to stop spamming at least some of it. --Kanonkas 18:29, 10 August 2008 (UTC)[reply]
Good idea, Herby! If you want I can set up a tool similar to SUL:, i.e. list user pages and blocks, for IPs. Of course, other tools are possible as well. --Erwin(85) 19:33, 10 August 2008 (UTC)[reply]

Today :) user:Restaurant-lumiere - restaurant spam - [11]. User page advert, series of images all with plenty of information about the restaurant in the "description". --Herby talk thyme 07:07, 11 August 2008 (UTC)[reply]

Well .. enough is enough then. The linkwatchers are from now on also parsing the user namespace. --Dirk Beetstra T C (en: U, T) 10:23, 11 August 2008 (UTC)[reply]
Bot is adapted for the new task. Had to tweak en:User:XLinkBot for that, but well, do I also have to add the 'Wikipedia:' namespace? --Dirk Beetstra T C (en: U, T) 10:36, 11 August 2008 (UTC)[reply]
Personally I think not but others may vary?
+ User talk:Americarx - online pharmacy ads [12], Commons (images & page) & en wp page (& the en wp one had been there a long time. Caught by Kanonkas so thanks. --Herby talk thyme 10:48, 11 August 2008 (UTC)[reply]
Everything that the linkwatchers parse is now getting into the database, and may trigger the XWiki functionality mechanism. We may get more work from this... some more manpower is still necessery (as there are things that I can autocatch which have been excluded this far ..). --Dirk Beetstra T C (en: U, T) 10:53, 11 August 2008 (UTC)[reply]
So are we going to make a transcluded subpage etc or this will get difficult :)
+ User:Tbraustralia - spam page - "TBR Australia is the parent company to TBR Calculators and Australian Student" - [13]. --Herby talk thyme 11:09, 11 August 2008 (UTC)[reply]
+ user:Housingyou - www.housingyoumakelaars.nl page & image [14]. --Herby talk thyme 17:57, 11 August 2008 (UTC)[reply]

I think others already asked about this, but shouldn't this type of listing of problem cross-project spammers/userpages be moved to a subpage? Cirt 23:06, 12 August 2008 (UTC)[reply]

For logging purposes, I put it on this page. I think that will work fine.  — Mike.lifeguard | @en.wb 00:46, 13 August 2008 (UTC)[reply]
For me it would just be easier to find and check users with the SUL tool if it were in some unified location on a subpage, but either way is probably okay. Cirt 02:21, 13 August 2008 (UTC)[reply]
Re-thought this... The original reason I had put things back on this page was for logging purposes (ie you can use the snippet at the top as normal). However, I think we can probably do with a separate page, and when things need to be blacklisted, we can do one of two things:
  1. Start a new section in proposed additions and link to the relevant oldid of the User: namespace abuse page; then log that when blacklisting (more complicated, but more transparent, which is an especially good thing for this set of cases, I think); or
  2. Add a logging snippet to the separate page (perhaps easier than the above, though less transparent).
I welcome comments either here or there.  — Mike.lifeguard | @en.wb 19:35, 20 August 2008 (UTC)[reply]

Our spam filter is now blocking spam URLs in edit summaries

FYI: our spam filter now appears to block spam addresses in edit summaries even if the domain is not in the page text. I just learned this the hard way. It's probably a response to all the shock site spam recently left in edit summaries by vandals; some will crash browsers. --A. B. (talk) 07:45, 20 August 2008 (UTC)[reply]

it's not a very new feature: see bugzilla:13599. -- seth 07:52, 20 August 2008 (UTC)[reply]
Yes, this has been mentioned before, and is quite a nice feature.
On a not-very-related-subject, do we think it would be a good idea to make rollback exempt from the spam blacklist? Removing spam which has been blacklisted should be a task separate from vandal fighting, and vandal fighting shouldn't be hindered by necessitating the removal of blacklisted domains. I'm not sure how difficult this would be to do from a technical standpoint, but it may be worth requesting. Input definitely requested.  — Mike.lifeguard | @en.wb 19:14, 20 August 2008 (UTC)[reply]
ehm, i guess, i didn't get your point. do you mean notional vandalists, who delete blacklisted links? -- seth 23:55, 20 August 2008 (UTC)[reply]
I mean if a page has a blacklisted link and a vandal blanks it or otherwise vandalizes, we cannot simply revert (which would be "adding" a blacklisted link) - you must instead edit the page to remove the link. This slows down vandal fighting.  — Mike.lifeguard | @en.wb 00:25, 21 August 2008 (UTC)[reply]
i see. have there been any vandalisms like that already? or would you just like to protect wikipedia pre-emtively? how should we practically (not technically) try to solve this? i guess that such vandalism can't be avoided. the block-problem could perhaps be reduced, but probably not fully avoided. -- seth 03:18, 21 August 2008 (UTC)[reply]

Mike.lifeguard (talk · contribs)'s idea makes a lot of sense and I support it. We should make rollback exempt from the spam blacklist. Cirt (talk) 02:19, 3 September 2008 (UTC)[reply]

I guess, if a rollback worked around the sbl, the sbl could be worked around too easy. :-) -- seth 07:50, 3 September 2008 (UTC)[reply]
I don't see how. You can use rollback to rollback only - you couldn't use it to spam. Note this has been requested as bugzilla:15450.  — Mike.lifeguard | @en.wb 21:17, 3 September 2008 (UTC)[reply]
If user A vandalizes and removes blacklisted (but good) links and if user B tries to undo that, then you are right. But:
If user A spams by an url, and user B adds this url to sbl and undoes user A's edits, then user A (or any other user!) may use the rollback function to respawn the blacklisted links (if rollbacks were not harmed by sbl). -- seth 23:01, 3 September 2008 (UTC)[reply]
True, but then the spammer would have to get access to rollback (and I would very quickly indef block that userif he manages to do this ..) .. but I think it is better to remove blacklisted links from revisions, as for other users they do give 'problems', having the link on the blacklist before we clean is merely handy to avoid disruption, it does not mean we don't have to clean anymore. --Dirk Beetstra T C (en: U, T) 09:26, 4 September 2008 (UTC)[reply]
There are more reasons for the fix of bug 1505; see e.g. bugzilla:14091.
Another example: In dewiki we had some spamming of an url, say www.example.org/something. Actually this url is good, but shall only be linked in article foo. However, some users posted that link always in some other articles, so this link was put to sbl and deleted from all pages execept talk pages and article foo. Because of the fix of bug 1505 the problem could be solved in that way. -- seth 10:23, 4 September 2008 (UTC)[reply]
Yes, that is true, but if 'foo' gets vandalised, and the link gets broken, you can't repair it without removing the link, or whitelisting the specific link. So it still needs removing, even if it is supposed to be there. It would be nice that we could still use rollback to revert then (overriding blacklist notice), but it does not actually solve the problem. The good thing is that we can now first blacklist a link, and then remove, we don't first have to try and keep up with a spammer in removing links before we blacklist, that is the only gain of this feature. --Dirk Beetstra T C (en: U, T) 14:37, 4 September 2008 (UTC)[reply]
If a page gets vandalized and a link gets broken (i guess, in dewiki we did not have such a case yet), one can temporarily comment the sbl entry. So we are again at the beginning... All possibilities have their advantages and disadvantages. In dewiki the vandalized blacklisted links are no problem, are they a serious problem in enwiki (or somewhere else)? -- seth 14:57, 4 September 2008 (UTC)[reply]
That I don't know. I expect that relatively new editors will simply follow the suggestion and remove the link, not knowing where to complain. But I do think that the proper solution is to whitelist such links anyway, then the problem is simply never occuring. --Dirk Beetstra T C (en: U, T) 15:00, 4 September 2008 (UTC)[reply]
No, blacklisted links should be removed in general. In cases where there is legitimate use, whitelisting should be used. However that is a separate issue from reverting vandalism. Removing bad links is a separate task from reverting vandalism - we shouldn't slow vandal fighters doing one job by forcing them to do another, unrelated job as well.  — Mike.lifeguard | talk 15:36, 4 September 2008 (UTC)[reply]
The thing is, if the link is blacklisted, but wanted, and a vandal damages the link, you need an administrator to repair the page completely. When the link is on the whitelist that can also be done by non-admins, who do a lot of antivandalism and antispam work as well (and for us cross-wiki spam fighters, I am an admin on en and meta, but if this happens on nl, de, whereever (I am thinking about link-high-jacking, e.g.), I need to find someone on those wikis to repair it). (Whee, I have just found another reason why global admins should be GLOBAL). --Dirk Beetstra T C (en: U, T) 15:41, 4 September 2008 (UTC)[reply]

Single wiki spam - opinions?

Over the past couple of days I have dealt with two spammers on Commons (related) & one on en wp. The links they are placing are of no value to the project whatsoever. In the case of the Commons ones the links can be seen here (+deleted ones) & here (sorry - deleted only). For the en wp one here (again - deleted only, all Chinese websites).

I do know that this list is solely for blacklisting cross wiki spamming however these seem to be of no value & I would be surprised if they did not try again (I have not had time to check cross wiki to be fair). I'd be inclined to list them but other opinions would be great. Cheers --Herby talk thyme 07:19, 6 September 2008 (UTC)[reply]

Well, it depends. I'm not against using the global blacklist if the link is clearly of no use, has been spammed a lot on a single project and could be spammed on other projects. Cases in which the URL is limited to one project and not expected to be spammed on other projects should be dealt with locally though. If I look at the Dutch Wikipedia's blacklist we have some Dutch URL's which probably won't be used on other projects, so there's no need to blacklist them here. --Erwin(85) 08:31, 6 September 2008 (UTC)[reply]
Agreed - when there is some strong indication that the spam will be cross-wiki it is appropriate to consider global blacklisting. I would be cautious in doing so, but yes it is worth discussing on a case-by-case basis.  — Mike.lifeguard | talk 14:25, 6 September 2008 (UTC)[reply]
I think there are two times when single-wiki spam should always be blacklisted:
  • URL redirects
  • Site with malware, viruses, etc.
There are some others that are judgement calls. Thousands of non-Wikimedia wikis using Media-Wiki elect to use the meta blacklist in their own filtering. Our list is publicly available and we have neither control over who else chooses to use it nor responsibility for their choice. Just the same, if a site is spammed to nl.wikipedia for instance but is total junk and would be unwelcome on every non-Wikimedia wiki, then it probably makes sense to blacklist it here; it's probably getting spammed to other Dutch wikis outside the Wikimedia world. --A. B. (talk) 02:27, 7 September 2008 (UTC)[reply]