——————————————— Archive, November 2015 ———————————————
Wiki Labels for jawiki
I read your message on jawiki.
I'm very interested in the Wiki Labels project and I'd like to help introducing Wiki Labels into Japanese Wikipedia. (A while ago I made a similar proposal on 2015 Community Wishlist Survey#Suggesting AbuseFilter by machine learning.)
I have done below 3 works today.
But below 2 are not done.
- Listing up of trusted usergroups: Where should I report this? Below groups is considers as trusted in jawiki
- Reviewing the auto-generated bad-words list: How can I report the review result? It seems that the auto-generated list is completely useless because they are only 1 character(Kanji)). It's hard to avoid such problem because they say that Japanese is a difficult language for computers to separate words...
Could you help me to introduce Wiki Labels into jawiki?
Sorry if this page is not suited for this message.
よろしくお願いします。（Thanks.)--aokomoriuta (talk) 19:43, 11 November 2015 (UTC)
- Hello / こんにちは
- So I processed the information and work you have done for us thus far. I am happy to report that we are very close to starting the wiki labels campaign on Ja Wikipedia as a consequence. :)
- So you already posted badwords and informal words on the correct location. You are welcome to add more and even add regexes.
- In the English language we delimiter words by spaces which isn't a good strategy for Japanese as far as I can tell. For Japanese our strategy is to treat each character as a word. If you have a different suggestion we will do our best to try to implement it. Indeed with my very limited understanding of Japanese I am aware it is more customary to have pairs or triples of Kanji. The generated list are kanji that statistically appear on reverted edits but not on regular edits. For this we use a TF-IDF approach. Some English curse words are made out of two or more words. "God Damn", "Fuck You" "Fuck Off" etc would be three examples. Words "God", "You" and "Off" would not normally be considered curse words as such our statistical approach would not treat them as such where as we would treat "Damn" and "Fuck" as curse words. Likewise we are trying to identify the Kanji that appear commonly in Japanese curse words even if they are not exclusively used in curses.
- There also are words that are reverted in articles but not on talk pages. In English this would include words like "hello" or "hahaha". Which Kanji would be informal like this?
- The idea here is to let the machine learning algorithm decide what to do with these words. Our approach relies on more features than just these word lists.
- -- とある白い猫 chi? 11:45, 14 November 2015 (UTC)
- Hi, I also came from the thread on jawiki. Re word delimiter: Can character-based N-gram tokenization be used for CJK languages, at least as a starting point? There is an open source implementation in Java: NGramTokenizer of Lucene. This approach won't need a word-segmented corpus to learn from and it should be simple enough to re-implement in Python, if necessary.
- A caveat is that the "generated list" would be less easy to read than that generated with more intelligent and complex approaches. Still, character N-grams should be more informative than one-character tokens which are currently shown on Meta—many of the character 2-grams, 3-grams and 4-grams of Japanese coincide with words and morphemes, while few characters stand as a word itself (and thus it can be hard for people to say whether a character is "bad" or not). I believe the same can be said to Korean and Chinese to some extent,
- Perhaps this particular issue on tokenization should go to phab:T111179? whym (talk) 12:31, 14 November 2015 (UTC)