User talk:とある白い猫

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
A Certain White Cat
Bilinen Bir Beyaz Kedi

User Page | Talk Page | Bot edits | Sandbox
Kullanıcı Sayfası | Mesajlar | Bot edits | Sandbox

EN JA TR Commons
Assume good faith!
Winged logo.gif

Wikimood 07.png

[purge] [edit]
Today is Friday, 27 November, 2015, and the current time is 02:54 (UTC/GMT).
There are currently 51,710 articles and 2,522 files on Meta.


Hello, welcome to my talk page. You are welcome to post comments below. Anything you put here will likely be archived and available for public view. Please be polite and civil.

To post a new topic please use this link or the '+' between "edit this page" and "history".


İyi günler, tartışma sayfama hoşgeldiniz. Aşağıya düşünce ve yorumlarınızı yazmaktan çekinmeyiniz. Buraya yazdığınız herşey arşivlenip kamu görüş alanına açık olacaktır. Nazik ve medenî olunuz.

Birşey yazmak için lütfen bu linki veya "Değiştir" ve "Geçmiş" arasındaki '+'ya basın.

User posts[edit]

March 31 is the deadline for finishing your Individual Engagement Grant proposal![edit]

Hello とある白い猫! Thank you for drafting an Individual Engagement Grant proposal.

The deadline to submit a completed proposal to be considered for funding in this round is 31 March 2014. When you have completed all parts of your proposal, please submit it for review by updating the wikimarkup in your page from status=DRAFT to status=PROPOSED.

Need help finishing your proposal? You can ask us anything in the IEG Questions forum! We're also hosting several IEG Proposal Hangouts this month - the last one is on Saturday, 29 March 2014 at 1700 UTC - please join us to get help in real time!

This message was delivered automatically using global message delivery. Only Wikipedians with IEGs in 'draft' status received this message.

IEG IdeaLab review.png

Wiki Labels for jawiki[edit]

とある白い猫さん、こんにちは!(Hello, とある白い猫!)

I read your message on jawiki. I'm very interested in the Wiki Labels project and I'd like to help introducing Wiki Labels into Japanese Wikipedia. (A while ago I made a similar proposal on 2015 Community Wishlist Survey#Suggesting AbuseFilter by machine learning.)

I have done below 3 works today.

But below 2 are not done.

  • Listing up of trusted usergroups: Where should I report this? Below groups is considers as trusted in jawiki
    • abusefilter
    • bureaucrat
    • checkuser
    • eliminator
    • interface-editor
    • oversight
    • rollbacker
    • sysop
  • Reviewing the auto-generated bad-words list: How can I report the review result? It seems that the auto-generated list is completely useless because they are only 1 character(Kanji)). It's hard to avoid such problem because they say that Japanese is a difficult language for computers to separate words...

Could you help me to introduce Wiki Labels into jawiki? Sorry if this page is not suited for this message.

よろしくお願いします。(Thanks.)--aokomoriuta (talk) 19:43, 11 November 2015 (UTC)

Hello / こんにちは
So I processed the information and work you have done for us thus far. I am happy to report that we are very close to starting the wiki labels campaign on Ja Wikipedia as a consequence. :)
So you already posted badwords and informal words on the correct location. You are welcome to add more and even add regexes.
In the English language we delimiter words by spaces which isn't a good strategy for Japanese as far as I can tell. For Japanese our strategy is to treat each character as a word. If you have a different suggestion we will do our best to try to implement it. Indeed with my very limited understanding of Japanese I am aware it is more customary to have pairs or triples of Kanji. The generated list are kanji that statistically appear on reverted edits but not on regular edits. For this we use a TF-IDF approach. Some English curse words are made out of two or more words. "God Damn", "Fuck You" "Fuck Off" etc would be three examples. Words "God", "You" and "Off" would not normally be considered curse words as such our statistical approach would not treat them as such where as we would treat "Damn" and "Fuck" as curse words. Likewise we are trying to identify the Kanji that appear commonly in Japanese curse words even if they are not exclusively used in curses.
There also are words that are reverted in articles but not on talk pages. In English this would include words like "hello" or "hahaha". Which Kanji would be informal like this?
The idea here is to let the machine learning algorithm decide what to do with these words. Our approach relies on more features than just these word lists.
-- とある白い猫 chi? 11:45, 14 November 2015 (UTC)
Hi, I also came from the thread on jawiki. Re word delimiter: Can character-based N-gram tokenization be used for CJK languages, at least as a starting point? There is an open source implementation in Java: NGramTokenizer of Lucene. This approach won't need a word-segmented corpus to learn from and it should be simple enough to re-implement in Python, if necessary.
A caveat is that the "generated list" would be less easy to read than that generated with more intelligent and complex approaches. Still, character N-grams should be more informative than one-character tokens which are currently shown on Meta—many of the character 2-grams, 3-grams and 4-grams of Japanese coincide with words and morphemes, while few characters stand as a word itself (and thus it can be hard for people to say whether a character is "bad" or not). I believe the same can be said to Korean and Chinese to some extent,
Perhaps this particular issue on tokenization should go to phab:T111179? whym (talk) 12:31, 14 November 2015 (UTC)