User talk:とある白い猫/Archive/2015/11

とある白い猫
A Certain White Cat
Bilinen Bir Beyaz Kedi
User Page | Talk Page | Bot ^edits | Sandbox
Kullanıcı Sayfası | Mesajlar | Bot ^edits | Sandbox

^{EN JA TR Commons}

Hello this is an Archive. Please do not edit. You are welcome to post comments regarding material here at my user talk page.

Always believe in yourserf and your dreams, you have a wing!

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec		Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2006		02	03	04	05	06	07	08	09	10	11	12	2011	01	02	03	04	05	06	07	08	09	10	11	12
2007	01	02	03	04	05	06	07	08	09	10	11	12	2012	01	02	03	04	05	06	07	08	09	10	11	12
2008	01	02	03	04	05	06	07	08	09	10	11	12	2013	01	02	03	04	05	06	07	08	09	10	11	12
2009	01	02	03	04	05	06	07	08	09	10	11	12	2014	01	02	03	04	05	06	07	08	09	10	11	12
2010	01	02	03	04	05	06	07	08	09	10	11	12	2015	01	02	03	04	05	06	07	08	09	10	11	12

——————————————— Archive, November 2015 ———————————————

Wiki Labels for jawiki[edit]

Latest comment: 8 years ago3 comments3 people in discussion

とある白い猫さん、こんにちは！(Hello, とある白い猫!)

I read your message on jawiki. I'm very interested in the Wiki Labels project and I'd like to help introducing Wiki Labels into Japanese Wikipedia. (A while ago I made a similar proposal on 2015 Community Wishlist Survey#Suggesting AbuseFilter by machine learning.)

I have done below 3 works today.

Translating of interfaces
Setting up landing page on jawiki.
Manually adding to bad/informal words list

But below 2 are not done.

Listing up of trusted usergroups: Where should I report this? Below groups is considers as trusted in jawiki
- abusefilter
- bureaucrat
- checkuser
- eliminator
- interface-editor
- oversight
- rollbacker
- sysop
Reviewing the auto-generated bad-words list: How can I report the review result? It seems that the auto-generated list is completely useless because they are only 1 character(Kanji)). It's hard to avoid such problem because they say that Japanese is a difficult language for computers to separate words...

Could you help me to introduce Wiki Labels into jawiki? Sorry if this page is not suited for this message.

よろしくお願いします。（Thanks.)--aokomoriuta (talk) 19:43, 11 November 2015 (UTC)Reply

Hello / こんにちは

So I processed the information and work you have done for us thus far. I am happy to report that we are very close to starting the wiki labels campaign on Ja Wikipedia as a consequence. :)

So you already posted badwords and informal words on the correct location. You are welcome to add more and even add regexes.

In the English language we delimiter words by spaces which isn't a good strategy for Japanese as far as I can tell. For Japanese our strategy is to treat each character as a word. If you have a different suggestion we will do our best to try to implement it. Indeed with my very limited understanding of Japanese I am aware it is more customary to have pairs or triples of Kanji. The generated list are kanji that statistically appear on reverted edits but not on regular edits. For this we use a TF-IDF approach. Some English curse words are made out of two or more words. "God Damn", "Fuck You" "Fuck Off" etc would be three examples. Words "God", "You" and "Off" would not normally be considered curse words as such our statistical approach would not treat them as such where as we would treat "Damn" and "Fuck" as curse words. Likewise we are trying to identify the Kanji that appear commonly in Japanese curse words even if they are not exclusively used in curses.

There also are words that are reverted in articles but not on talk pages. In English this would include words like "hello" or "hahaha". Which Kanji would be informal like this?

The idea here is to let the machine learning algorithm decide what to do with these words. Our approach relies on more features than just these word lists.

-- とある白い猫 ^chi? 11:45, 14 November 2015 (UTC)Reply

Hi, I also came from the thread on jawiki. Re word delimiter: Can character-based N-gram tokenization be used for CJK languages, at least as a starting point? There is an open source implementation in Java: NGramTokenizer of Lucene. This approach won't need a word-segmented corpus to learn from and it should be simple enough to re-implement in Python, if necessary.

A caveat is that the "generated list" would be less easy to read than that generated with more intelligent and complex approaches. Still, character N-grams should be more informative than one-character tokens which are currently shown on Meta—many of the character 2-grams, 3-grams and 4-grams of Japanese coincide with words and morphemes, while few characters stand as a word itself (and thus it can be hard for people to say whether a character is "bad" or not). I believe the same can be said to Korean and Chinese to some extent,

Perhaps this particular issue on tokenization should go to phab:T111179? whym (talk) 12:31, 14 November 2015 (UTC)Reply