User talk:Lustiger seth/archive001

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Crosslink[edit]

moved from user talk:wiki_seth. -- seth 01:40, 13 August 2008 (UTC)

Hi Seth - can you provide a link to prove you own the Lustiger_seth account on dewiki please? Just add something to your userpage like "I own Wiki seth on Meta wiki". Just to make sure you're not an impersonator (which I'm sure you're not!) Thank you. Majorly talk 14:15, 9 August 2008 (UTC)

you are right. i guess that this is enough. -- seth 23:19, 11 August 2008 (UTC)

splitting logs[edit]

Would you split Spam blacklist/LogPre2008 too? Running off to work now, or would do it myself.  — Mike.lifeguard | @en.wb 09:57, 25 August 2008 (UTC)

i started it, but have got to leave now, too. ;-) -- seth 11:42, 25 August 2008 (UTC)
shall we really join both lists? the wiki-source would be small, but the rendered html-file would be quite large. -- seth 19:15, 25 August 2008 (UTC)
No, I'd say there is no need to have anything further back than the beginning of 2008 on Spam blacklist/Log. We should make Spam blacklist/Log/2008, Spam blacklist/Log/2007, Spam blacklist/Log/2006 so the log is split up somewhat, I think.  — Mike.lifeguard | @en.wb 19:34, 25 August 2008 (UTC)

Senseless entry removal[edit]

The log entry here doesn't work:

[0-9]+\.[-\w\d]+\.info/?[-\w\d]+[0-9]+[-\w\d]*\] # lustiger_seth # removal; senseless, see request

Can you try to figure out what it was trying to block & see if it can be fixed rather than removed? Thanks.  — Mike.lifeguard | talk 18:11, 6 September 2008 (UTC)

hi!
see Talk:Spam_blacklist#double.2Fwrong_entries (search for "before i start modifying the list").
the regexp never did, what it should have done, although it was corrected^Wmodified at least one time. as the original request is more that 2 years old, i guess those domains don't need to be blocked any longer, because otherwise the sbl would be full of such entries. -- seth 19:34, 6 September 2008 (UTC)
No, domains remain blacklisted until a request to de-list them is accepted. Please try to figure out what that regex should have been and fix it instead of removing it. Thanks.  — Mike.lifeguard | talk 22:59, 6 September 2008 (UTC)

Log entry[edit]

This one doesn't have the oldid specified in the template. Can you try to fix it?  — Mike.lifeguard | talk 11:58, 8 September 2008 (UTC)

oops, thx! -- seth 12:22, 8 September 2008 (UTC)

Temp sysop[edit]

You have the tools, again. Congratulations. Alex Pereira falaê 17:26, 9 October 2008 (UTC)

:-) -- seth 10:36, 14 October 2008 (UTC)

non-capturing patterns[edit]

Are we sure that using non-capturing patterns helps efficiency? I don't think it does (asked VVV about it at one point) - but would you ask someone to make sure? If it doesn't make a difference, I'd prefer to leave the ?: out; it makes things easier to read. But if it does help, then "Great!"  — Mike.lifeguard | @en.wb 01:42, 14 October 2008 (UTC)

Hi!
Although the extension is written in php, I guess it is quite similar to perl's regexp engine. However, I assume that php is not better in this than perl. ;-)
the php manual keeps shtum about most performance-increasing hacks, but the perl-manual says:
WARNING: Once Perl sees that you need one of $& , $` , or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression (?: ... ) instead.) [...] (perldoc perlre)
so "Great!"? :-) -- seth 10:25, 14 October 2008 (UTC)
Sounds good to me.  — Mike.lifeguard | @en.wb 17:28, 14 October 2008 (UTC)

IRC - regex help[edit]

Hi Seth, I seem to have some problems with 'catching links' from diffs, and some regex help would be appreciated. Do you have access to IRC, and if so, could you join us in #wikimedia-external-links or in #wikipedia-spam-t ?? Thanks! --Dirk Beetstra T C (en: U, T) 17:25, 3 December 2008 (UTC)

I'm there now. -- seth 18:17, 3 December 2008 (UTC)
...was... ;-) perhaps it's better to e-mail me. -- seth 19:34, 3 December 2008 (UTC)
Hmm .. we seem to have different times online. It is nothing urgent, but I'll try and explain the problem.
The linkwatcher-bots (perl scripts) that I am running retrieve every diff, and extract from that the texts that are added and removed, and clean them from tags. That all goes fine. They then fill arrays using regexes of patterns we want them to catch, one array on the removed part, one on the added part. The 'catches' that are not in the removed part but are in the added part are the things that are stored in a database, evaluated (counts etc.) and reported to IRC (see #wikipedia-en-spam and #cvn-sw-spam). The special cases go to other channels (#wikipedia-spam-t for english alerts; some high level alerts to #wikimedia-alerts, and bot-evaluated spam cases to #wikimedia-external-links) and to reports (via COIBot, another bot I run).
At this moment I have 4 regexes catching things from the diffs:
(?:http:\/\/)?[^\s\]\[\{\}\\\|^~`<>]+?@\w+(?!\.htm)(?:\.\w+){1,3}
(?<![\w\d-])(?<!isbn[\s])\d{3,5}[\s-]\d{2,3}[\s-]\d{4}(?![\d\w-])
ftp://[^\s\]\[\{\}\\\|^~`<>]+
http://[^\s\]\[\{\}\\\|^~`<>]+
The last two can be combined, and I am thinking about adding more. The problem lies in the first two. Example, applying that regex to this diff gives the following results:
(the numbers between the brackets are counts for the specific catches). It should catch 646-227-4900, that looks like a telephone number / social security number. But the three three other numbers are a string "+1 415 839 6885", which should be caught (or at least the 415 839 6885 part of it) by the second regex. I am not sure where the problem is, but I am afraid it is in the brackets around the lookbehind/lookforward. They seem to result in certain cases in 'secondary' catches (the regex is performed as "@array = $added =~ m/$regex/sig;").
Another problem (for which I coded a workaround) is that links 'http://www.somewhere.com/someone@somehow.sometime' are caught both by the first and fourth rule in the list above (the one for email addresses and the one for http-links). The workaround is now that after regex I check if the a previous regex did not catch the same string. But it would be nice to have it work a bit more strict.
I wondered if you would be able to help me with this. I can adapt things in some of the channels in runtime (the rules are in a database as well). Thanks already! --Dirk Beetstra T C (en: U, T) 10:45, 4 December 2008 (UTC)
Adding, I am generally online during worktime (from about 9-10 in the morning until about 5-6 every afternoon; if I guess your timezone correctly, you should add one hour to those times; I am in Cardiff, Wales, UK), and some time around that (but irregular). We may see each other there. Extra info: I can set up the bots (BigWikiLW2) to send de specific 'spam' to an own channel as well (see #wikimedia-external-links). If you are interested, tell me where and I will set it up. --Dirk Beetstra T C (en: U, T) 11:01, 4 December 2008 (UTC)
Hi!
1. (second regexp)
Your second regexp looks good. I don't see any assertion-problem. the parentheses belong to look-ahead and look-behind assertions and there is no capturing, if that is what you were worrying about.
You strip the tags from the added text of the diff, and you said "that all goes fine". But what is the result e.g. of the given diff? What is done with {{nowrap|1=<span style="text-align:left">+1 415 839 6885</span>}}? Will it become just "+1 415 839 6885"?
In that case you can see, that
perl -e "$_='+1 415 839 6885'; my @array = $_ =~/(?<![\w\d-])(?<!isbn[\s])\d{3,5}[\s-]\d{2,3}[\s-]\d{4}(?![\d\w-])/sig; print $array[0];" (windows-code! in bash you have to invert the " and ')
will result in "415 839 6885" and not just "415". So I guess your bug is not inside the regexp. I guess I need more code to be able to analyse that in a better way.
2. (e-mail and web addresses)
The first regexp shall catch e-mail addresses only? Why do you start with (?:http:\/\/)? then? you could use zero-width negative look-behind assertions or you cold build one big regexp from those four regexp, but in reverse order, so that http-addresses will get cought with higher priority than e-mail addresses.
3. (addtion)
timezone: you're right, here it's UTC+1 now (CET). but I can't give you regular times, sometimes I'm here, sometimes I'm not. :-) But probably the next 2 weeks I'm here very often after 11:00 (utc) and before 20:00 (utc). -- seth 12:25, 4 December 2008 (UTC)
concerning 1. Ok, via chat you found out, that the problem was a split by " " after a join by " ". I guess, the workaround by splitting and joining by "_" is not good either, because e-mail and web addresses may contain "_". Perhaps it would be better to use "\" (or one or more signs of '[]{}'), because it will never be a part of your matches. Btw. why don't you allow "~" in web addresses? "~" may be part of urls. -- seth 13:23, 4 December 2008 (UTC)

When blacklisting[edit]

Please use the template. The SBL stats tool uses it (& makes things easy to parse at-a-glance). Thanks  — Mike.lifeguard | @en.wb 02:23, 12 February 2009 (UTC)

Ok, thx. (I didn't know that.) -- seth 18:10, 12 February 2009 (UTC)

Good catch[edit]

Thanks for saving my butt :)  — Mike.lifeguard | @en.wb 02:03, 26 February 2009 (UTC)

;-) -- seth 02:08, 26 February 2009 (UTC)

Adminship[edit]

Hello Seth. Please fill in your details at Template:List of administrators. —Anonymous DissidentTalk 10:35, 8 April 2009 (UTC)

Notice of review of adminship[edit]

Hello,

In accordance with Meta:Administrators/Removal and because you have made fewer than ten logged actions over the past six months, your adminship is under review at Meta:Administrators/Removal/October 2009. If you would like to retain your adminship, please sign there before 2009-10-9. Kind regards, —Anonymous DissidentTalk 15:10, 2 October 2009 (UTC)

Hi!
Thx for that notice. I signed there. -- seth 07:19, 3 October 2009 (UTC)