The deadline to submit a completed proposal to be considered for funding in this round is 31 March 2014. When you have completed all parts of your proposal, please submit it for review by updating the wikimarkup in your page from status=DRAFT to status=PROPOSED.
Need help finishing your proposal? You can ask us anything in the IEG Questions forum! We're also hosting several IEG Proposal Hangouts this month - the last one is on Saturday, 29 March 2014 at 1700 UTC - please join us to get help in real time!
This message was delivered automatically using global message delivery. Only Wikipedians with IEGs in 'draft' status received this message.
Listing up of trusted usergroups: Where should I report this? Below groups is considers as trusted in jawiki
Reviewing the auto-generated bad-words list: How can I report the review result? It seems that the auto-generated list is completely useless because they are only 1 character(Kanji)). It's hard to avoid such problem because they say that Japanese is a difficult language for computers to separate words...
Could you help me to introduce Wiki Labels into jawiki? Sorry if this page is not suited for this message.
In the English language we delimiter words by spaces which isn't a good strategy for Japanese as far as I can tell. For Japanese our strategy is to treat each character as a word. If you have a different suggestion we will do our best to try to implement it. Indeed with my very limited understanding of Japanese I am aware it is more customary to have pairs or triples of Kanji. The generated list are kanji that statistically appear on reverted edits but not on regular edits. For this we use a TF-IDF approach. Some English curse words are made out of two or more words. "God Damn", "Fuck You" "Fuck Off" etc would be three examples. Words "God", "You" and "Off" would not normally be considered curse words as such our statistical approach would not treat them as such where as we would treat "Damn" and "Fuck" as curse words. Likewise we are trying to identify the Kanji that appear commonly in Japanese curse words even if they are not exclusively used in curses.
There also are words that are reverted in articles but not on talk pages. In English this would include words like "hello" or "hahaha". Which Kanji would be informal like this?
The idea here is to let the machine learning algorithm decide what to do with these words. Our approach relies on more features than just these word lists.
Hi, I also came from the thread on jawiki. Re word delimiter: Can character-based N-gram tokenization be used for CJK languages, at least as a starting point? There is an open source implementation in Java: NGramTokenizer of Lucene. This approach won't need a word-segmented corpus to learn from and it should be simple enough to re-implement in Python, if necessary.
A caveat is that the "generated list" would be less easy to read than that generated with more intelligent and complex approaches. Still, character N-grams should be more informative than one-character tokens which are currently shown on Meta—many of the character 2-grams, 3-grams and 4-grams of Japanese coincide with words and morphemes, while few characters stand as a word itself (and thus it can be hard for people to say whether a character is "bad" or not). I believe the same can be said to Korean and Chinese to some extent,
Perhaps this particular issue on tokenization should go to phab:T111179? whym (talk) 12:31, 14 November 2015 (UTC)