Jump to content

Talk:Spam blacklist

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Lustiger seth (talk | contribs) at 07:52, 20 August 2008 (→‎Our spam filter is now blocking spam URLs in edit summaries: +re). It may differ significantly from the current version.

Latest comment: 15 years ago by Lustiger seth in topic Discussion
Shortcut:
WM:SPAM
The associated page is used by the Mediawiki Spam Blacklist extension, and lists strings of text that may not be used in URLs in any page in Wikimedia Foundation projects (as well as many external wikis). Any meta administrator can edit the spam blacklist. There is also a more aggressive way to block spamming through direct use of $wgSpamRegex. Only developers can make changes to $wgSpamRegex, and its use is to be avoided whenever possible.

For more information on what the spam blacklist is for, and the processes used here, please see Spam blacklist/About.

Please post comments to the appropriate section below: Proposed additions, Proposed removals, or Troubleshooting and problems, read the messageboxes at the top of each section for an explanation. Also, please check back some time after submitting, there could be questions regarding your request. Per-project whitelists are discussed at MediaWiki talk:Spam-whitelist. In addition to that, please sign your posts with ~~~~ after your comment. Other discussions related to this last, but that are not a problem with a particular link please see, Spam blacklist policy discussion.

Completed requests are archived (list, search), additions and removal are logged.

snippet for logging: {{/request|1141934#{{subst:anchorencode:SectionNameHere}}}}

If you cannot find your remark below, please do a search for the URL in question with this Archive Search tool.

Spam that only affects single project should go to that project's local blacklist

Proposed additions

This section is for proposing that a website be blacklisted; add new entries at the bottom of the section, using the basic URL so that there is no link (example.com, not http://www.example.com). Provide links demonstrating widespread spamming by multiple users on multiple wikis. Completed requests will be marked as {{added}} or {{declined}} and archived.




nijmegennieuws.nl and doetinchemnieuws.nl

Added



and



Spammed by



The bots are down, so this request is for logging only. --Erwin(85) 12:22, 10 August 2008 (UTC)Reply

What do you mean "for logging only" & what does that have to do with the bots being down? I must be missing something here.  — Mike.lifeguard | @en.wb 01:00, 13 August 2008 (UTC)Reply
The LinkWatchers reported three edits for doetinchemnieuws.nl and then stopped reporting anything on IRC. Some time later I checked this IP's edits using Luxo's tool and noticed he kept on spamming. I added the request here to be able to refer the Spam blacklist/Log to this request, specifically Luxo, as I couldn't refer to XWiki reports. The one about doetinchemnieuws.nl showed three edits and there wasn't any on nijmegennieuws.nl. Does this explain? --Erwin(85) 08:04, 13 August 2008 (UTC)Reply
OK, yeah. Perhaps I'm a bit off today.  — Mike.lifeguard | @en.wb 02:49, 14 August 2008 (UTC)Reply

porno-izlee.com





Just 2 links until now, but I believe he is just beginning to add them, I put him on our bl. Best regards, --birdy geimfyglið (:> )=| 23:06, 13 August 2008 (UTC)Reply

This can stay, I think. Added Logged  — Mike.lifeguard | @en.wb 19:45, 15 August 2008 (UTC)Reply

People may want to look into other things on the same server:

  • Top 10 domains on server porno-izlee.com (67.159.45.5): deniztube.com (153), mynewhaircut.net (52), ghanaclips.com (13), turkishi.com (7), youtubecity.net (6), DenizTube.com (5), porno-izlee.com (4), redindir.com (4), faveladodarocinha.com (3), 911researchers.com (2)








  • Top 10 editors who have added deniztube.com: 85.99.214.79 (108), 88.228.40.137 (10), 78.169.38.36 (8), 88.228.18.253 (7), LovelessGent (5), 78.169.46.60 (4), 85.99.215.176 (3), 78.169.48.172 (2), 88.228.36.165 (2), 88.228.37.67 (2).
























The top one has 108 linkadditions, and a range of wikis.

Seems to lead even further, ghanaclips.com has a different set of IP users (in a 41.210 range and some others), but that seems en only (but where did I see kokoliko.com recently).

















....

I need help to prune this out completely. --Dirk Beetstra T C (en: U, T) 17:16, 18 August 2008 (UTC)Reply

qpc.ro

Spam domain

Note that the specific pages linked are to probable copyright violations (they host copies of movies and TV shows):





Spam account




--A. B. (talk) 18:45, 14 August 2008 (UTC)Reply

internationalbadminton.org

i'm not sure about this one. have a look at this diagnostic page: [1]. have there been comparable cases in past? -- seth 21:27, 17 August 2008 (UTC)Reply

Thanks seth - to me malware sites are always listable to protect the many wikis who depend on this list. Added Added, cheers --Herby talk thyme 14:14, 18 August 2008 (UTC)Reply
i guess, the site is/was hacked temporarily. it is linked many times in :de and :en, probably because its content is useful. and i don't know how long google leaves hacked sites in its abuse-list. -- seth 16:45, 18 August 2008 (UTC)Reply
I don't see any abuse (according to my database), I suggest that if the problem is gone, it is removed, as blacklisting here does disrupt the pages on-wiki (if someone vandalises the page and removes the link, the edit can not be reverted). (What we would need for these is a regex list of external links which are 'disabled', not 'blacklisted'). --Dirk Beetstra T C (en: U, T) 16:48, 18 August 2008 (UTC)Reply
I've rem'd it out for now. We have (& should) BL sites that contain exploits but if it is not current that is another matter. Maybe we can get more on the google exploit one? Looked legit warning to me. Cheers --Herby talk thyme 16:52, 18 August 2008 (UTC)Reply

tarkanfunclub.com



Spammers










Fansite. Spammer's contribs speak for themselves. See also w:WT:WPSPAM#Fanclub spammer (permanent link). MER-C 13:39, 18 August 2008 (UTC)Reply

Yes - more than a nuisance, Added Added. Thanks --Herby talk thyme 17:46, 18 August 2008 (UTC)Reply


More domains spammed




  • Google Adsense: 5551319961929303


  • Google Adsense: 2743631921357480












  • Google Adsense: 5551319961929303


Related domain


--A. B. (talk) 00:03, 19 August 2008 (UTC)Reply


Second batch Added Added --A. B. (talk) 00:35, 19 August 2008 (UTC)Reply

unitursa.com spam

Spam domains







Related domains











Spam account



Reference

--A. B. (talk) 00:12, 19 August 2008 (UTC)Reply


Added Added --A. B. (talk) 00:36, 19 August 2008 (UTC)Reply

firme.rs spam

Spam domains




Google Adsense ID: 1349757567489797


Related domain



Spam accounts





Reference

--A. B. (talk) 00:16, 19 August 2008 (UTC)Reply


Added Added --A. B. (talk) 00:37, 19 August 2008 (UTC)Reply

onlineseo.info

Domains




Google Adsense ID: 3239128903599293


Related domain



Accounts











Reference

--A. B. (talk) 00:17, 19 August 2008 (UTC)Reply


Added Added --A. B. (talk) 00:37, 19 August 2008 (UTC)Reply


mysmp.com

Domain


Google Adsense ID: 0719114306637522


Accounts







Reference

--A. B. (talk) 03:24, 20 August 2008 (UTC)Reply


Rich Media Project

Domain



Accounts









References

--A. B. (talk) 03:34, 20 August 2008 (UTC)Reply


web-anatomy.com

Spam accounts







Spam domain


.


References

--A. B. (talk) 03:52, 20 August 2008 (UTC)Reply

Proposed additions (Bot reported)

This section is for websites which have been added to multiple wikis as observed by a bot.

Items there will automatically be archived by the bot when they get stale.

Sysops, please change the LinkStatus template to closed ({{LinkStatus|closed}}) when the report is dealt with, and change to ignore for good links ({{LinkStatus|ignore}}). More information can be found at User:SpamReportBot/cw/about

These are automated reports, please check the records and the link thoroughly, it may be good links! For some more info, see Spam blacklist/help#SpamReportBot_reports

If the report contains links to less than 5 wikis, then only add it when it is really spam. Otherwise just revert the link-additions, and close the report, closed reports will be reopened when spamming continues.

The bot will automagically mark as stale any reports that have less than 5 links reported, which have not been edited in the last 7 days, and where the last editor is COIBot. They can be found in this category.

Please place suggestions on the automated reports in the discussion section.

Running, will report a certain domain shortly after a link is used more than 2 times by one user on more than 2 wikipedia (technically: when more than 66% of this link has been added by this user, and more than 66% of this link were added XWiki). Same system as SpamReportBot (discussions after the remark "<!-- Please put comments after this remark -->" at the bottom; please close reports when reverted/blacklisted/waiting for more or ignore when good link)

List Last update By Site IP R Last user Last link addition User Link User - Link User - Link - Wikis Link - Wikis
vrsystems.ru 2023-06-27 15:51:16 COIBot 195.24.68.17 192.36.57.94
193.46.56.178
194.71.126.227
93.99.104.93
2070-01-01 05:00:00 4 4

Proposed removals

This section is for proposing that a website be unlisted; please add new entries at the bottom of the section.

Remember to provide the specific domain blacklisted, links to the articles they are used in or useful to, and arguments in favour of unlisting. Completed requests will be marked as {{removed}} or {{declined}} and archived.

See also /recurring requests for repeatedly proposed (and refused) removals.

The addition or removal of a domain from the blacklist is not a vote; please do not bold the first words in statements.





youporn.com

The YouPorn wikipedia article should have a link to youporn.com but this is blocked by the spam filter. --Helohe 14:04, 11 August 2008 (UTC)Reply

You're right, please request whitelisting for a main-page specific url on the appropriate whitelist page (en:MediaWiki talk:Spam-whitelist). As such  Declined here. Thanks. --Dirk Beetstra T C (en: U, T) 14:05, 11 August 2008 (UTC)Reply


Lluisllach.pl

lluisllach.pl is a fine site, referring to a Geocities page. No spam, no porn. There are many pages about Lluis Llach, and the link was accepted by the Polish one. Bloking really does not seem necessary. The preceding unsigned comment was added by 212.39.28.26 (talk • contribs) 12:17, 16 Aug 2008 (UTC)

The site (as you have spelt it) does not appear to be blacklisted here. Thanks --Herby talk thyme 12:23, 16 August 2008 (UTC)Reply
the sbl is case-insensitive, the entry is
\blluisllach\.pl\b
for a given url you can use [2] (beta state) to find the corresponding entries. -- seth 13:53, 16 August 2008 (UTC)Reply
Thanks seth - that way it is here because of this report. It was reverted, links placed again so listed. Looks valid to me. For anyone who doesn't look at it the appeal is by the Ip that was responsible for the link placement. Cheers --Herby talk thyme 13:55, 16 August 2008 (UTC)Reply
 Declined per Herby and original report.  — Mike.lifeguard | @en.wb 23:16, 16 August 2008 (UTC)Reply

Troubleshooting and problems

This section is for comments related to problems with the blacklist (such as incorrect syntax or entries not being blocked), or problems saving a page because of a blacklisted link. This is not the section to request that an entry be unlisted (see Proposed removals above).

double/wrong entries

when i deleted some entries from the german sbl, which are already listed in the meta sbl, i saw that there are many double entries in the meta sbl, e.g., search for

top-seo, buy-viagra, powerleveling, cthb, timeyiqi, cnvacation, mendean

and you'll find some of them. if you find it useful, i can try to write a small script (in august), which indicates more entries of this kind.
furthermore i'm wondering about some entries:

  1. "\zoofilia", for "\z" matches the end of a string.
  2. "\.us\.ma([\/\]\b\s]|$)", for ([\/\]\b\s]|$) ist the same as simply \b, isn't it? (back-refs are not of interest here)
  3. "1001nights\.net\free-porn", for \f matches a formfeed, i.e., never
  4. "\bweb\.archive\.org\[^ \]\{0,50\}", for that seems to be BRE, but php uses ERE, so i guess, this will never match
  5. "\btranslatedarticles\].com", for \] matches a ']', so will probably never match.

before i go on, i want to know, if you are interested in this information or not. :-) -- seth 22:23, 12 July 2008 (UTC)Reply

You know, we could use someone like you to clean up the blacklist... :D Kylu 01:53, 13 July 2008 (UTC)Reply
We are indeed interested in such issues - I will hopefully fix these ones now; keep 'em coming!  — Mike.lifeguard | @en.wb 01:59, 13 July 2008 (UTC)Reply
Some of the dupes will be left for clarity's sake. When regexes are part of the same request they can be safely consolidated (I do this whenever I find them), but when they are not, it would be confusing to do so, in many cases. Perhaps merging regexes in a way that is sure to be clear in the future is something worth discussing, but I can think of no good way of doing so.  — Mike.lifeguard | @en.wb 02:06, 13 July 2008 (UTC)Reply
in de-SBL we try to cope with that only in our log-file [3]. there one can find all necessary information about every white-, de-white-, black- and de-blacklisting. the sbl itself is just a regexp-speed-optimized list for the extension without any claim of being chronologically arranged.
i guess, that the size of the blacklist will remain increasing in future, so a speed-optimazation perhaps will be necessary in future. btw. has anyone ever made any benchmarks of this extension? i merely know that once there had been implemented a buffering.
oh, and if one wants to correct further regexps: just search by regexps (e.g. by vim) for /\\[^.b\/+?]/ manually and delete needless backslashes, e.g. \- \~ \= \:. apart from that the brackets in single-char-classes like [\w] are needless too. "\s" will never match. -- seth 11:36, 13 July 2008 (UTC)Reply
fine-tuning: [1234] is much faster in processing than (1|2|3|4); and (?:foo|bar|baz) is faster than (foo|bar|baz). -- seth 18:21, 13 July 2008 (UTC)Reply
I benchmarked it, (a|b|c) and [abc] had difference performance. Same with the latter case — VasilievV 2 21:02, 14 July 2008 (UTC)Reply
So should we be making those changes? (ie was it of net benefit to performance?)  — Mike.lifeguard | @en.wb 21:56, 15 July 2008 (UTC)Reply
these differences result from the regexp-implementation. but what i ment with benchmarking is the following: how much does the length of the blacklist cost (measured in time)? i don't know, how fast the wp-servers are. however, i benchmarked it now on my present but old computer (about 300-500MHz):
if i have one simple url like http://www.example.org/ and let the ~6400 entries of the present meta-blacklist match against this url, it takes about 0,15 seconds till all regexps are done. and i measured really only the pure matching:
// reduced part of SpamBlacklist_body.php
foreach($blacklists as $regex){
  $check = preg_match($regex, $links, $matches);
  if($check){
    $retVal = 1;
    break;
  }
}
so i suppose, that it would not be a bad idea to care about speed, i.e. replace unnecessary patterns by faster patterns and remove double entries. ;-)
if you want me to, i can help with that, but soonest in august.
well, the replacement is done quickly, if one of you uses vim
the replacement of (.|...) by [...] can be done manually, because there are just 6 occurrences. the replacement of (...) by (?:...) can be done afterwards by
:%s/^\([^#]*\)\(\\\)\@<!(\(?\)\@!/\1(?:/gc
-- seth 23:26, 15 July 2008 (UTC)Reply
some explicit further bugs:
\mysergeybrin\.com -> \m does not exist
\hd-dvd-key\.com -> \h does not exist
however, because nobody answered (or read?) my last comment... would it be useful to give me temporarily the rights to do the modifications by myself? -- seth 01:44, 7 August 2008 (UTC)Reply
I fixed these. You can always request (temporary) sysop status. Any help is appreciated. --Erwin(85) 12:45, 7 August 2008 (UTC)Reply
requested and got it. :-) -- seth 09:18, 13 August 2008 (UTC)Reply

before i start modifying the list, a want to know, whether i should log my changes somewhere. oh, and btw. i suppose that the entry [0-9]+\.[-\w\d]+\.info\/?[-\w\d]+[0-9]+[-\w\d]*\] is somehow senseless, for it will probably never match. i found the original discussion [4] (the regexp was changed afterwards), but the regexp will not grep the links mentioned there. shall i just delete such an entry or shall a make a new request and try to correct it? -- seth 09:18, 13 August 2008 (UTC)Reply

It would be nice if you could update the log as well, so we can still find the corresponding log message. Though maybe we should wait and see if anything new comes out of #The Logs. I guess it's best to correct wrong entries or in any case log all those removals. It probably wouldn't hurt if some were removed, but I have no idea how many entries we're talking about. --Erwin(85) 09:31, 13 August 2008 (UTC)Reply
ok, so i'll wait until the other thread is finished. but i don't think, that a manipulating of the logs is a good idea, because this will make tracing of entry changes difficult.
i guess, there are less than 10, perhaps even less than 5 useless entries. -- seth 10:29, 13 August 2008 (UTC)Reply

double entries

i wrote a small script to grep most of the double (or multi) entries. the result is presented on User:Lustiger_seth/sbl_double_entries. as you can see, there are many (>250) redundant entries. i guess, we could delete more than 200 entries. -- seth 22:59, 19 August 2008 (UTC)Reply

User: namespace abuse

User:Restaurant-lumiere



per Herby.  — Mike.lifeguard | @en.wb 22:13, 11 August 2008 (UTC)Reply

Americarx



 — Mike.lifeguard | @en.wb 22:13, 11 August 2008 (UTC)Reply

Blocked on 3 wikis already - one to watch, I think.  — Mike.lifeguard | @en.wb 22:52, 11 August 2008 (UTC)Reply

Tbraustralia



 — Mike.lifeguard | @en.wb 22:14, 11 August 2008 (UTC)Reply

Housingyou



 — Mike.lifeguard | @en.wb 22:14, 11 August 2008 (UTC)Reply

Thebalfourgroup



Commons & enwiki so far - not blocked yet.  — Mike.lifeguard | @en.wb 22:45, 11 August 2008 (UTC)Reply
Deleted & warned on Commons.  — Mike.lifeguard | @en.wb 00:14, 12 August 2008 (UTC)Reply

hkcbn.org



Not sure if it is crosswiki, leaving it here but going to bed now, best regards, --birdy geimfyglið (:> )=| 02:54, 12 August 2008 (UTC)Reply
It was (& some cross wiki blocks need placing). Added Added by User:Kylu thanks --Herby talk thyme 06:52, 12 August 2008 (UTC)Reply
Hit en.wb (thanks kylu): deleted & blocked. Some organization on cross-wiki blocks is needed.  — Mike.lifeguard | @en.wb 12:44, 12 August 2008 (UTC)Reply
And wouldn't global sysop be good for just this stuff........:( --Herby talk thyme 12:50, 12 August 2008 (UTC)Reply
Hm, Kylu marked them for deletion, all those wikis have local active crats, they have to clean themselves, best regards, --birdy geimfyglið (:> )=| 12:52, 12 August 2008 (UTC)Reply


Spam page on Commons. --Herby talk thyme 07:09, 12 August 2008 (UTC)Reply

Jerald Franklin Archer



Violin lessons page on Commons & en wp. --Herby talk thyme 07:09, 12 August 2008 (UTC)Reply

fabianswebworld.fa.funpic.de fschneider.de.vu



Next one :( --birdy geimfyglið (:> )=| 10:57, 12 August 2008 (UTC)Reply

Looks quite cross-wiki to mee, I Added Added it already, --birdy geimfyglið (:> )=| 11:04, 12 August 2008 (UTC)Reply
Thanks birdy - links all cleared. --Herby talk thyme 11:55, 12 August 2008 (UTC)Reply

Nervenhammer



similar pattern, adding a personal link..--Cometstyles 12:01, 12 August 2008 (UTC)Reply
Thanks Comets - Added Added for now. In passing I see no harm in listing such sites as much to send a message to the user that their behaviour may not be appropriate. Not sure about how lasting teh listing should be our logging immediately - thoughts welcome. --Herby talk thyme 12:12, 12 August 2008 (UTC)Reply
Reviewing this it may well be a good faith de user who has just decided to expand there interests (based on SUL info). In which case I suggest serious consideration for de-listing if we are asked. --Herby talk thyme 12:17, 12 August 2008 (UTC)Reply

MariaTash



Jewellery sales. Page & images on Commons, user page ad on en wp. --Herby talk thyme 18:22, 12 August 2008 (UTC)Reply

Daliahilfi



"Talent Lab" recruitment page on Commons. --Herby talk thyme 18:24, 12 August 2008 (UTC)Reply

Autofinance



Cross wiki spam pages. (autofinance-ez.com is the domain). --Herby talk thyme 12:59, 13 August 2008 (UTC)Reply

Bestlyriccollection



What is that? I stumbled into it when 84.109.83.73 was vandalizing through the wikis. Best regards, --birdy geimfyglið (:> )=| 10:41, 14 August 2008 (UTC)Reply

Very odd indeed. fr wp didn't like the idea of a "user page for bookmarks". Not sure that it is spam but sure doesn't look like "normal" user pages. Looking some more & other opinions would be good. --Herby talk thyme 11:02, 14 August 2008 (UTC)Reply
They have a point [5]... I don't understand why he needs that in multiple wikis, I mean, if he (miss)uses his userpage for bookmarks, why on many places, --birdy geimfyglið (:> )=| 12:25, 14 August 2008 (UTC)Reply

Discussion

Another xwiki user page abuse?

I came across this one today. It seems to have the makings of a non contributor who is creating user pages with personal links on. Any other views? Cheers --Herby talk thyme 07:27, 1 August 2008 (UTC)Reply

Not good. Don't have time to remove links currently, but certainly worth doing (and probably blacklisting too).  — Mike.lifeguard | @en.wb 12:02, 1 August 2008 (UTC)Reply
hmm seems problematic, I have removed all of the links and if he re-adds, we may have to blacklist it ..--Cometstyles 12:34, 1 August 2008 (UTC)Reply
Do I really have to add user space (not the talk) to the spaces to parse for the linkwatchers. It is just a small step for me .. --Beetstra public 21:06, 1 August 2008 (UTC)Reply
If it is easy to do, by all means! Currently tracking these is very much hit-and-miss. We found JackPotte through the SWMTBots, but that will not always be assured, as they are not designed to watch for this sort of thing.  — Mike.lifeguard | @en.wb 23:43, 1 August 2008 (UTC)Reply
Yes, good idea. A lot of users have spam in userspace. JzG 22:29, 2 August 2008 (UTC)Reply

A bit of delay handling this one...







 — Mike.lifeguard | @en.wb 18:48, 9 August 2008 (UTC)Reply

Added Added both.  — Mike.lifeguard | @en.wb 18:13, 10 August 2008 (UTC)Reply

NOINDEX

Prior discussion at Talk:Spam_blacklist/Archives/2008/06#Excluding_our_work_from_search_engines, among other places

There is now a magic word __NOINDEX__ which we can use to selectively exclude certain pages from being indexed. I suggest having the bots use this magic word in all reports generated immediately. Whether to have this page and it's archives indexed was a point of contention previously, and deserves further discussion.  — Mike.lifeguard | @en.wb 01:33, 4 August 2008 (UTC)Reply

Sorry missed this one. I certainly support the "noindex" of the bot pages. They are somewhat speculative. If we could get the page name changed I would be happier about not using the magic word on this but..... --Herby talk thyme 16:09, 6 August 2008 (UTC)Reply
I have added the keyword to the COIBot generated reports, they should now follow that. --Dirk Beetstra T C (en: U, T) 16:31, 6 August 2008 (UTC)Reply
My bot is flagged now, so I can start adding it to old reports. I will poke a sysadmin first to see if I really must make ~12000 edits before I start though. It will not be all in one go, and I will not start for a day or two.
Any other thoughts on adding it to this page and/or it's archives?  — Mike.lifeguard | @en.wb 18:11, 9 August 2008 (UTC)Reply
Already sort-of done with {{linkstatus}}, so the bot probably won't run. I plan to keep the flat though <evil grin>  — Mike.lifeguard | @en.wb 22:56, 11 August 2008 (UTC)Reply

Renaming the blacklist should be done at some point in the future; we'll have to wait on Brion for that. Until then, I'd like to have this page and it's archives __NOINDEX__ed. Having it indexed causes more issues than it solves & we now have an easy way to remedy the situation. We should review this when the blacklist is renamed.  — Mike.lifeguard | @en.wb 02:55, 14 August 2008 (UTC)Reply

The Logs

log system

I would like to consolidate our logs into one system which uses subpages and transclusions to make things easy. Each month would get a subpage, which is then transcluded onto Spam blacklist/Log so they can easily be searched. This would mean merging Nakons "log entries" into the main log, and including the pre-2008 log. This wouldn't require much change in how we log things.

However, I wonder what people think about also logging removals and/or changes to the regexes. Currently, we don't keep track of those in any systematic way, but I think we should. For example, I consolidated a few regexes a while back, and simply made the old log entries match the new regexes, which is rather Orwellian. Similarly, we simply remove log entries when we remove domains - nothing is added to the log, so we cannot track this easily. This idea (changing the way we log things) is likely going to require some discussion; I don't think there should be any problem moving to transcluded subpages immediately.

 — Mike.lifeguard | @en.wb 14:41, 6 August 2008 (UTC)Reply

I'm all for using one system for the logs. I'm not sure about your second idea though. Is the log intended purely to explain the current entries or also former entries and perhaps even edits? Logging removals would be a good idea to see if a domain was once listed, but logging changes seems too bureaucratic. Matching the log entries with the new regexes might be Orwellian, but it's also pragmatic. What are the advantages of logging changes? Could you perhaps give an example of how you suggest to log changes? --Erwin(85) 18:16, 6 August 2008 (UTC)Reply
I should say I mean "Orwellian" without the connotative value. The denotative value is simply that the current method is "changing history" - not in and of itself a bad thing. Indeed, I've had no issues with this, hence the speculative nature of that part of my suggestion.  — Mike.lifeguard | @en.wb 19:48, 6 August 2008 (UTC)Reply
in de:WP:SBL we do log all new entries, removals and changes on black- and whitelists. logging changes can be useful e.g. for retracing old discussions. -- seth 01:35, 7 August 2008 (UTC)Reply
i think, that the transclusions are a good idea to keep the traffic low. is anybody against that?
concerning the logging of removals/modifications: what do you think about a log system like de:Wikipedia:Spam-blacklist/log#Mai_2008? -- seth 12:12, 13 August 2008 (UTC)Reply
It would be quite some work to link the diffs, but I'm not against using it. I guess that means this is a weak support. --Erwin(85) 09:35, 19 August 2008 (UTC)Reply

tool for log searching

The simpliest way to improve searchability is to write a tool that searches the logs for you. I'm in the middle of doing so, and I'll have a working prototype in a few days. The way this would work is it would load all the pages (really does not matter where the pages are), and apply a few regex to them. This means we really don't have to merge nacon's stuff, I can just add that page to the tool. As long as the logs keep the same pattern of one entry per line, a tool is not difficult.

I don't really think logging removals is smart, we never remove entries from the logs anyway. Simpliest way is to keep the logs write only (only new entries), and have a tool list all matches. (I'm writing the tool in a manner where you will be able to put the domain in "plain", as in google.com, and it will find all the relevant entries, even if it has \bgoogle\.com\b, or some other weirdness. —— nixeagle 20:23, 6 August 2008 (UTC)Reply

lol, by accident i started writing a similar tool 2 hours ago. but i write a cli-perl-script only. until now it greps all sbl-entries (in meta-blacklist, de-blacklist and de-whitelist), which would match a given url. -- seth 01:35, 7 August 2008 (UTC)Reply
Seth, nixeagle: actually, having a tool that searches all blacklists and logs (i.e. cross-wiki) to see if it is blacklisted somewhere, and if there is a log for that would be great. IMHO, it should be 'easy' to write a tool that extracts all regexes from the page, and tries if it is possitive against a certain url that we search (and it could then be incorporated into the {{linksummary}} to easily find it ..). Or is this just what you guys are working on ;-) .. --Dirk Beetstra T C (en: U, T) 09:50, 7 August 2008 (UTC)Reply
beta version. :-) -- seth 14:38, 7 August 2008 (UTC)Reply
WONDERFUL!
one question, can you make it add 'http://' by itself (as we only put the domain in the linksummary as to prevent the blacklist to block it ..). --Dirk Beetstra T C (en: U, T) 14:48, 7 August 2008 (UTC)Reply
Thats about what I was writing. I was putting it in the framework of http://toolserver.org/~eagle/spamArchiveSearch.php where the tool retrieves the section/page and links you directly to where the item was mentioned. For logs I was working on displaying the line entry in the log as one of the results, so you would not even have to view the log page. —— nixeagle 15:11, 7 August 2008 (UTC)Reply
if you want to combine my script with that framework, i can give you the source code. but it is perl-code and it is ugly, about 110 lines. -- seth 17:00, 7 August 2008 (UTC)Reply
some pages

Suggestion for pages:

Thanks! --Dirk Beetstra T C (en: U, T) 14:48, 7 August 2008 (UTC)Reply

i had to cope with a bug in en-sbl. but now it seems to work. further suggestions? (the more lists i include, the slower the script will get.)-- seth 16:44, 7 August 2008 (UTC)Reply
I would suggest to do it progressive, first meta and en blacklist, the rest later (roughly in order of wiki-size), similar to luxo does. --Dirk Beetstra T C (en: U, T) 17:07, 7 August 2008 (UTC)Reply
i used a hash, and those don't care about the order of declaration. now it should be sorted. -- seth 22:11, 7 August 2008 (UTC)Reply

User page advertising

Another "thinking aloud" one!

I guess I come across a commercial orietated user page on Commons once a day on average. The past week has bought a "Buying cars" page, an "Insurance sales" page, a "Pool supplies" page as well as blog/software/marketing pages. I do usually run vvv's SUL tool but quite often there is nothing immediatly (the Pool suplies one cropped up on en wp a couple of days after Commons). I know en wp are often reluctant to delete such pages out of hand (which I find incredible).

I think I am probably saying should we open up a section here to allow others to watch/comment/block/delete or whatever across wikis? --Herby talk thyme 09:51, 10 August 2008 (UTC)Reply

I agree, this is a great idea, as I have also noticed spammers like this go cross-wiki to multiple projects (Wikinews/Commons, etc.) Cirt 11:40, 10 August 2008 (UTC)Reply
Agree. Others may be interested in watching only that part of our work - perhaps a transcluded subpage so it may be watched separately?  — Mike.lifeguard | @en.wb 14:03, 10 August 2008 (UTC)Reply
Sounds like the best way to proceed. Cirt 14:09, 10 August 2008 (UTC)Reply
Thanks so far - good to get other views as well but as an idea of the scale I picked these out from the last few days on Commons (all user names) -
Sungate - design advert
Totalpoolwarehouse - obvious & en wp too
Theamazingsystem - two spamvert pages "The Automated Blogging System is a Powerful SEO Technology"
Adventure show - pdf spam file
Firmefront - fr "Banque, Assurance, Gestion Alternative et Private Equity"
The Car Spy - internet car sales
DownIndustries - clothing sales
Serenityweb1 - Nicaragua tourism & en wp
Macminicover - "Dust Cover or Designer Cover for Apple Mac"
I can't instantly find the insurance sales one & I am sure another user produced a page the same as Theamazingsystem. We could do with working out the best way of presenting the info - whether the standard template is needed or whether just an SUL link would allow us a quick check on cross wiki activity?
It would be good to know if the COI bot excludes User: space and whether that may need rethinking?
Cheers --Herby talk thyme 14:35, 10 August 2008 (UTC)Reply
So far as I know, it watches only the mainspace. But Beetstra above said this could be changed.  — Mike.lifeguard | @en.wb 14:40, 10 August 2008 (UTC)Reply
Not sure what else is in the works but I think an SUL link to check activity cross-projects would be sufficient. Anything else would be above and beyond but would also be nice. Cirt 15:06, 10 August 2008 (UTC)Reply
The standard {{ipsummary}} template is pretty good but (I think) lacks the SUL link which for this kind of stuff would be useful (luxo would be a help tho I guess).
The other thing I guess would be to get agreement to lock the blatantly commercial accounts just so that they do not do a "JackPotte" on us I think. I'll maybe point a couple of people to this section. --Herby talk thyme 16:04, 10 August 2008 (UTC)Reply
As it happens I was just trying to lock an account that wasn't SUL yet. I think the concept is sound, these accounts prolly should be locked and hidden. Not sure about mechanics of implementation. ++Lar: t/c 18:39, 10 August 2008 (UTC)Reply
IPs can't have a unified account, so the SUL tool is useless. We have luxo's for that.  — Mike.lifeguard | @en.wb 16:49, 10 August 2008 (UTC)Reply
Yeah - this type really needs an SUL link I think. And we do nee to look at the best way we can lock overtly commercial accounts I think. --Herby talk thyme 16:51, 10 August 2008 (UTC)Reply
Today I also saw some spamming, by 3 accounts on Commons:Talk:Main Page, I have to say that I do agree with Herby about this here, really a nice idea on how to stop spamming at least some of it. --Kanonkas 18:29, 10 August 2008 (UTC)Reply
Good idea, Herby! If you want I can set up a tool similar to SUL:, i.e. list user pages and blocks, for IPs. Of course, other tools are possible as well. --Erwin(85) 19:33, 10 August 2008 (UTC)Reply

Today :) user:Restaurant-lumiere - restaurant spam - [6]. User page advert, series of images all with plenty of information about the restaurant in the "description". --Herby talk thyme 07:07, 11 August 2008 (UTC)Reply

Well .. enough is enough then. The linkwatchers are from now on also parsing the user namespace. --Dirk Beetstra T C (en: U, T) 10:23, 11 August 2008 (UTC)Reply
Bot is adapted for the new task. Had to tweak en:User:XLinkBot for that, but well, do I also have to add the 'Wikipedia:' namespace? --Dirk Beetstra T C (en: U, T) 10:36, 11 August 2008 (UTC)Reply
Personally I think not but others may vary?
+ User talk:Americarx - online pharmacy ads [7], Commons (images & page) & en wp page (& the en wp one had been there a long time. Caught by Kanonkas so thanks. --Herby talk thyme 10:48, 11 August 2008 (UTC)Reply
Everything that the linkwatchers parse is now getting into the database, and may trigger the XWiki functionality mechanism. We may get more work from this... some more manpower is still necessery (as there are things that I can autocatch which have been excluded this far ..). --Dirk Beetstra T C (en: U, T) 10:53, 11 August 2008 (UTC)Reply
So are we going to make a transcluded subpage etc or this will get difficult :)
+ User:Tbraustralia - spam page - "TBR Australia is the parent company to TBR Calculators and Australian Student" - [8]. --Herby talk thyme 11:09, 11 August 2008 (UTC)Reply
+ user:Housingyou - www.housingyoumakelaars.nl page & image [9]. --Herby talk thyme 17:57, 11 August 2008 (UTC)Reply

I think others already asked about this, but shouldn't this type of listing of problem cross-project spammers/userpages be moved to a subpage? Cirt 23:06, 12 August 2008 (UTC)Reply

For logging purposes, I put it on this page. I think that will work fine.  — Mike.lifeguard | @en.wb 00:46, 13 August 2008 (UTC)Reply
For me it would just be easier to find and check users with the SUL tool if it were in some unified location on a subpage, but either way is probably okay. Cirt 02:21, 13 August 2008 (UTC)Reply

Our spam filter is now blocking spam URLs in edit summaries

FYI: our spam filter now appears to block spam addresses in edit summaries even if the domain is not in the page text. I just learned this the hard way. It's probably a response to all the shock site spam recently left in edit summaries by vandals; some will crash browsers. --A. B. (talk) 07:45, 20 August 2008 (UTC)Reply

it's not a very new feature: see bugzilla:13599. -- seth 07:52, 20 August 2008 (UTC)Reply