Jump to content

Grants:IEG/Semi-automatically generate Categories for Vietnamese Wikipedia

From Meta, a Wikimedia project coordination wiki
statusselected
Semi-automatically generate Categories for Vietnamese Wikipedia
summaryThis project will create new categories for small-scale & medium-scale Wikipedias where lack of editors. The method is based on NLP patterns, including XY, X in Y, X [VBN] Y, X of Y, X by Y, etc. A tool will detect interlinks X, Y and then create new categories such as X in Y, X of Y, .. in Vietnamese.
targetVietnamese Wikipedia
strategic priorityto create new categories and do categorization which is similar to english category classification.
themetools
amount7000 USD
granteeAlphama
contact• alphamawikipedia@gmail.com
volunteerAlphama
this project needs...
volunteer
advisor
join
endorse



Project idea

[edit]

English category classification, which is a research object of many scholars, including lots of papers about restructuring categories, removing abundant categories, analyzing category structure [1], extracting semantic relations [2], etc. The category management could not be implemented effectively in many small-scale Wikipedias because the number of active editors is limited. Therefore, this project aims to create new categories semi-automatically by using a method, which is based on NLP patterns [3][4], English taxonomy and Vietnamese category conventions [5].

What is your solution?

[edit]

Firstly, we translate NLP patterns from English [3][6] to Vietnamese [5][7]:

Pattern (English Pattern (Vietnamese) English example Vietnamese Example
X in Y X ở Y Category:Cities in France, X = Category:Cities, Y = Category:France Thể loại:Các thành phố ở Pháp (currently Thể loại:Thành phố Pháp, implicit meaning is French city not cities in France), X = Thể loại:Thành phố or Thể loại:Các thành phố*, Y = Thể loại:Pháp
XY YX Category:Information Technology, X = Category:Information, Y = Category:Technology Thể loại:Công nghệ thông tin, Y = Thể loại:Công nghệ, X = vi:Thể loại:Thông tin
X by Y X theo Y Category:Birds by country, X = Category:Birds, Y = Category:Countries Thể loại:Chim theo quốc gia, X = Thể loại:Chim or Thể loại:Các con/loài chim* (main category Thể loại:Lớp Chim), Y = Thể loại:Quốc gia or Thể loại:Các quốc gia*
X of Y X của Y Category:Provinces of Panama, X = Category:Provinces, Y = Category:Panama X = vi:Thể loại:Các tỉnh của Panama or Thể loại:Tỉnh của Panama*, X = Thể loại:Tỉnh, Y = Thể loại:Panama
X from Y X từ Y Category:People from Madrid, X = Category:People, Y = Category:Madrid Thể loại:Người từ Madrid, X = Thể loại:Người, Y = Thể loại:Madrid
X [VBN] Y X [VBN] Y Category:Songs written by Rollo Armstrong, X = en:Category:Songs, Y = Category:Rollo Armstrong vi:Thể loại:Bài hát viết bởi Rollo Armstrong, X = Thể loại:Bài hát or Thể loại:Các bài hát *, Y = Thể loại:Rollo Armstrong (viết = write, bởi = by)
X in Y (X=yyyy) Y năm X (X=yyyy) Category:2014 in Japan, X = Category:2014, Y = Category:Japan Thể loại:Nhật Bản năm 2014, X = Thể loại:2014, Y = Thể loại:Nhật Bản
X(s) in Y (X=yyyy) Y X (X=yyyy) Category:2010s in Vietnam, X = Category:2010s, Y = Category:Vietnam Thể loại:Việt Nam thập niên 2010, X = Thể loại:Thập niên 2010, Y = Thể loại:Việt Nam
...
*See name convention discussions section about plural cases.

Secondly, we will only create new categories and classify them into articles according to English category taxonomy. An new category needs to have an interwiki link with English and must contain at least a parent category and an item. The creation of empty categories depends on the local community agreement.

Project goals

[edit]
  • To create semi-automatically new categories for Vietnamese Wikipedia.
  • To create more fine-grained category taxonomy for Vietnamese Wikipedia.
  • Help to reduce the manual tasks related to category classification.
  • To boost the collaborative quality of Vietnamese Wikipedia.

Project plan

[edit]

Activities

[edit]

1. Collect Vietnamese NLP patterns

[edit]

Each language has its own NLP patterns, so we will collect Vietnamese ones in this step. We ourselves research and translate English patterns to Vietnamese patterns. Finally, we give these patterns to community discussions.

2. Naming convention discussions

[edit]

We organize some discussions about how Vietnamese Wikipedia [8] accept category names. We collect ideas and agreements of Vietnamese editors for standardizing the naming convention.

English category name can differentiate between single name and plural name. Vietnamese has no explicitly expressed number (single vs plural). [9] Vietnamese plural marking (những and các) can be added before single nouns to make plural nouns. However, this will make category name verbose but the meaning unchanges in some cases. For example, in English, there are two categories, Category:Language and Category:Languages. In Vietnamese, the editors could not explain the difference of two categories clearly, so they created Thể loại:Ngôn ngữ (ngôn ngữ = language, Category:Language) corresponding to these categories. We can create Thể loại:Các ngôn ngữ (các ngôn ngữ = languages) but this name is longer. If we prefer to have a short category name, we may not use the plural marking to form plural nouns.

Plural marking is not always to create a proper plural name. For example, Category:Birds was be translated to Thể loại:Chim (chim=bird). We do not use Thể loại:Các chim or Thể loại:Những chim or Thể loại:Các con chim (các con chim = birds) because their meaning is fuzzy or category name is verbose.

Some questions which will appear in our discussions:

  • How do editors classify categories in their Wiki? Which category structure (taxonomy) do they depend on? Share your experiences or methods when classify categories.
  • Will Vietnamese community accept the English taxonomy? Why and why not?
  • Will editors accept the deepest categories will be classified into articles or depend on the number of articles of categories to decide which category should be classified?
  • Should editors accept empty categories which may be important for the classification in future?

3. Collect interlingual categories, modify category name and create new catogories

[edit]

We scan and get all categories that interlinks (Wikidata sitelinks) with English. According to step 2, we will modify category name based on community agreements. For each new category, we check its parent categories and its articles based on English taxonomy. We use bot to scan the relationship between categories (had interwiki links) and articles (sub categories) in English then apply these relationships to Vietnamese Wikipedia (other Wikipedias in future).

4. Check categories before adding

[edit]

When inserting the Added Categories (AC) which are from RDF triples or are new categories into articles and other categories, first we format and sort the Existing Categories (EC) of articles, then we check the relationship between AC and EC (en:Wikipedia:Categorization) to determine we should add AC or not.

5. Receive feedbacks

[edit]

After all, we receive feedbacks from local Wikipedians, and then fix some errors (if happen). We can determine the project success and orientate the future plan for this project.

Budget

[edit]
  • Team members: 1-2 members
No. Task Description Budget
1 Translation and collect NLP patterns We collect and research NLP patterns. 500 USD
2 Project management Write documents, reports and supervise the project 1000 USD
3 Support discussions, research solutions (or make some surveys) Organize online discussions with editors of local Wikis. Get ideas from some scholars. 1000 USD
4 Design & building program Build a tool with these functions: collect data, modify category name, check category classfication & category relationship, translate terms, create new categories, link sitelinks at Wikidata, log tasks 2500 USD
5 Data collection and running cost Run tool to collect data: collect categories which have interwiki links, RDF triples, categories of articles/categories. Then, run tool to create new categories, enrich category classification. 1000 USD
6 Assessments and error fixes Follow up every edits, fix errors and collect reverted edits, comments to make assessments. Sometimes, we need to deal with some editors for some special categories. Basically, this step is done manually. 1000 USD
  • Total: 7000 USD

Community engagement

[edit]

We need the help of Vietnamese community in assessing the category names. We have many names (synonyms) refer to a category name. Local editors will decide which names properly for their languages. Some new editors outside Wikipedia may join to this project to help translation and grammar assessment.

Sustainability

[edit]

This project will continue to develop whenever new categories are found in English but lack in other languages through out a tool. Furthermore, this project helps to align all Wikipedias categories structure.

This project is not only for Vietnamese Wikipedia but also for all languages. If the project is deployed successfully, it will apply for other languages. We expect this project will continue contribute to all small-scale & medium-scale Latin Wikipedias.

Measures of success

[edit]

Need target-setting tips? Note: in addition to your project-specific measures of success, you will also be asked to report on some Global Metrics at the end of your final report. Please keep this in mind as you plan, and we'll support you as you begin your project.

This project will update new naming conventions which are considered as standards for modifying all existing category names.

The Category Tool:

  • will be used by everybody with proper access rights to create and classify new categories for articles
  • create at least 5000 new categories (or more)
  • update 5000 RDF triples (X-belongs to-Y, X = article/category, Y = category) (or more)

Get involved

[edit]

Participants

[edit]

We did develop Alphama Category tool [1] which automatically classify categories for Vietnamese category structure. AlphamaBot contributed more than 500000 edits for Vietnamese Wikipedia. This program is written in C# and the outcome is similar to RDF triples (X-category of-Y). Then, we use AutoWikiBrowser to insert these triples into Wikipedia content. Recently, I developed Alphama Editor [2] which includes Alphama Category and other functions.

Community Notification

[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

I noticed some Vietnamese Wikipedians (on their talk pages) who will join and give me some suggestions about this project.


Endorsements

[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • Community member: add your name and rationale here.
  • Currently the category tree in Vietnamese Wikipedia is a mess and no one has enough time and patience to restructure it. With this tool we can be asure that the category tree stays always up-to-date with Englishs Wikipedia. Very helpful. Na Tra (talk) 19:43, 27 September 2015 (UTC)
  • It sounds like an awesome plan to me! Nguyentrongphu (talk) 00:02, 28 September 2015 (UTC)
  • I am using the editor module and the category converter of Alphama and it does help me a lot in translating categories from English to Vietnamese. Love to see this project can enhance even further. Tuanminh01 (talk) 04:43, 28 September 2015 (UTC)
  • It can help editors shorten the time sorting the articles by categories and can be developed for other wikis.AlleinStein (talk) 04:45, 30 September 2015 (UTC)
  • I'm a big fan of projects like this that would help to level the playing field between larger and smaller Wikipedias. Rationalizing the category structure has always been a tedious and thankless job, particularly when there are redundant or conflicting categories somewhere in the hierarchy. Automating this work would go a long way toward making newly translated articles more discoverable. Minh Nguyễn 💬 07:37, 1 October 2015 (UTC)
  • It can help editors shorten the time for the classification of articles in many types. I find this is an indispensable project, specially for other wikis. Zajzajmkhvtc90 (talk) 15:38, 3 October 2015 (UTC)
  • I support this project, it helps vi.wikipedia community to create and manage all categories effectively.Earthandmoon (talk) 10:08, 4 October 2015 (UTC)
  • If it can help community, then why not? --minhhuy (talk) 13:59, 5 October 2015 (UTC)
  • category trees are always delicate, I suppose small wiki haven't enough time and users to perform the right maintenance. I think we don't even have the right wikimetric tools to describe the robustness of the cat trees. Let's see if the tool works and how to export the idea to other platforms.--Alexmar983 (talk) 00:48, 6 October 2015 (UTC)
  • This is an interesting project that could substantially improve the quality of categories on small wikis and substantially reduce the time investment required to do so. It's clear that Alphama has substantial buy-in from Vietnamese Wikipedians -- which is the only substantial blocker I see to success beyond some good ol' hard work. --EpochFail (talk) 18:36, 1 December 2015 (UTC)

References

[edit]
  1. Zesch, T., & Gurevych, I. (2007, April). Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007) (pp. 1-8).
  2. Chernov, S., Iofciu, T., Nejdl, W., & Zhou, X. (2006). Extracting Semantics Relationships between Wikipedia Categories. SemWiki, 206.
  3. a b Nastase, V., & Strube, M. (2008, July). Decoding Wikipedia Categories for Knowledge Acquisition. In AAAI (Vol. 8, pp. 1219-1224).
  4. Nastase, V., Strube, M., Börschinger, B., Zirn, C., & Elghafari, A. (2010, May). WikiNet: A Very Large Scale Multi-Lingual Concept Network. In LREC.
  5. a b Wikipedia:Thể loại
  6. Xu, L., Takeda, H., Hamasaki, M., & Wu, H. (2010). Typing software articles with Wikipedia category structure.
  7. Depend on community discussions
  8. Talk page about Vietnamese Category
  9. Ho-Dac Tuc. Vietnamese-English Bilingualism: Patterns of Code-switching. Psychology Press, 2003. ISBN 0700713220. Page 57-58.