Grants:IEG/Semi-automatically generate Categories for some small-scale & medium-scale Wikis

From Meta, a Wikimedia project coordination wiki
statusnot selected
Semi-automatically generate Categories for some small-scale & medium-scale Wikis
summaryThis project will support to create new categories for small-scale & medium-scale Wikipedias where lack of editors. The method is based on NLP patterns such as XY, X in Y, X [VBN] Y, X of Y, X by Y and more. A system will detect interlinks X, Y and then create new categories such as X in Y, X of Y, .. in different languages.
targetVietnamese Wikipedia and other three Wikis (any small-scale Wikipedias are Latin based languages)
strategic priorityto create new categories and arrange them based on english category classification.
themetools
amount15500 USD
contact• alphamawikipedia@gmail.com
volunteerAlphama
this project needs...
volunteer
grantee
advisor
join
endorse
created on23:50, 8 September 2014 (UTC)
2014 round 2



Project idea[edit]

English category classification was researched by many scholars in many years. There were lots of research about restructuring categories, removing abudant categories (comparing with WORDNET), etc which also had in German Wikipedia, French Wikipedias as well as some big Wikis. For many smaller-scale Wikipedias which have a modest number of categories as well as a small number of active editors (contributors) who can not create and manage all categories effectively. Therefore, the solution is to use bot to create these categories based on NLP patterns and help the local community to categorize the category structure based on English taxonomy or German taxonomy.

What is your solution?[edit]

First, I translate NLP patterns to some specific languages from English, such has pattern "X in Y" (English) can be translated to "X ở Y" in Vietnamese. For example with pattern: XY in English = YX in Vietnamese.

  • In English, we have Category:Information Technology => X = Category:Information, Y = Category:Technology
  • In Vietnamese, if we have Y = Thể loại:Công nghệ (Category:Technology), X = Thể loại:Thông tin (Category:Information). We can create new category has name Thể loại:Công nghệ thông tin' (Category:Information Technology) with two parent categories Thể loại:Công nghệ and X = Thể loại:Thông tin.

We also have a lot of patterns to generate new categories such as X by Y (Category:Birds by country), X in Y (Category:Cities in Vietnam), X of Y, X from Y (Category:People from London), X [VBN] Y, ...

Then secondly, I will classify these categories into articles based on English category classification. [1][2]. Secondary solution is to add common categories to articles only or we comply to the community agreements about category taxonomy.

Project goals[edit]

  • To create semiautomatically new categories for small-scale & medium-scale Wikipedias.
  • To create more fine-grained category taxonomy for small-scale & medium-scale Wikipedias.
  • Help to reduce the manual tasks related to category of editors.
  • To boost the collaborative quality of small-scale & medium-scale Wikipedias.

Project plan[edit]

Activities[edit]

1. Collect NLP patterns in different languages[edit]

Each language has its own NLP patterns based on the basic rules. Such as in English, category pattern XY becomes YX in Vietnamese. So this must be researched in detail for each languages. We can form some basic patterns:

  • English: XY, X in Y, X of Y, X from Y, XY by Z (Category:Information technology by region), ...
  • Vietnamese: YX, X ở Y, X của Y, X từ Y, XY theo Z, ...
  • And other languages

2. Name convention discussions of each Wikipedia for category name[edit]

We have to organize some discussions about how local Wikipedia accept category names. Many languages put stress on plural. For example, in English, Category:Language is quite different from Category:Languages but for Vietnamese Wikipedia, we did not have any convention and standards for plural. So we consider Category:Language is the same with Category:Languages. We focus on some questions:

  • How do editors classify categories at their Wiki? Collect their experiences when they classify categories. Which category structure (taxonomy) do they depend on?
  • Do editors of local languages accept the English taxonomy or German taxonomy? Why?
  • Will editors accept the deepest categories will be classified into articles or depend on the number of articles of categories to decide which category should be classified?
  • Should editors accept empty categories which may be important for classification in future?

3. Collect existing categories that have interlinks & Create new catogories[edit]

We scan a list of categories that have interlinks (Wikidata sitelinks) with other languages (normally English). Then, in one language, if we create a new category in one language, based on this list, we also can create in other languages automatically or semi-automatically by applying NLP patterns. For each new category, we also check and insert the its categories parent as well as articles belong to it based on English taxonomy. Besides, we check the popular categories on Wiki Commons and Wikidata to know which categories needed to be created at local languages.

4. Find RDF triples (articles-categories triples) based on English classification[edit]

This step we use bot to scan the relationship between categories (had interwiki links) and articles (sub categories) in English then apply these relationships for other Wikipedias.

5. Check categories before add[edit]

This step will support for 3 and 4. When inserting the categories (AC) which are from RDF triples or are new categories into articles and other categories, first we format and sort the existing categories of articles (EC), then we check the relationship between AC and EC to determine we should add AC or not.

6. Receive feedbacks[edit]

After all, we receive feedbacks from local Wikipedias, and then fix some errors (if happen). With feedbacks, we can orientate the success of project as well as the future plan for this project. We form some new main and popular categories which will be proposed as common categories and contribute to Wikidata.

Budget[edit]

  • Team members: 2-3 members
No. Task Description Budget
1 Translation and collect NLP patterns Small-scale & medium-scale Wikipedias may not offer enough editors to help us follow up the result. If this project receives the help of community enough, it may not need this budget. 500 USD/each language, 4 languages = 2000 USD.
2 Project management Write documents, reports and supervise the project 1500 USD
3 Support discussions, research solutions, design & building program Organize discussion with editors of local Wikis. Build tools with functions: collect data, check category classfication & category relationship, translate terms, create new categories, link sitelinks at Wikidata, log tasks 5000 USD
4 Data collection and running cost Run tools to collect data: collect categories which have interwiki links, RDF triples, categories of articles/categories 500 USD/each language, 4 languages = 2000 USD.
5 Assessments and error fixes Follow up every edits, fix errors and collect reverted edits, comments to make assessments 500 USD/each language, 4 languages = 2000 USD.
6 Maintenance cost Continue to fix errors 6 months after project finished. To test the stability of edits and category structure within 6 months when we have new edits of editors in this period 750 USD/each language, 4 languages = 3000 USD.
  • Total: 15500 USD
  • Total without maintenance cost: 12500 USD
  • Total if deploying project for Vietnamese Wikipedia only: 8750 USD

Community engagement[edit]

I need the help of community in assessing the category name in their languages. We have many names (synonyms) refer to a category name. Local editors will decide which names properly for their languages.

Some new editors outside Wikipedia may join to this project to help translation and grammar assessment.

Sustainability[edit]

This project will continue to develop whenever new categories are found in English but lack in other languages through out a tool. Furthermore, this project helps to align all Wikipedias categories structure.

This project is not only for Vietnamese Wikipedia but also for all languages. If the project is deployed successfully, it will apply for other languages.

I expect this project will continue contribute to all small-scale & medium-scale Latin Wikipedias.

Measures of success[edit]

Need target-setting tips? Note: in addition to your project-specific measures of success, you will also be asked to report on some Global Metrics at the end of your final report. Please keep this in mind as you plan, and we'll support you as you begin your project.

Each language must create at least 5000 new categories (can be up to 200000 new categories and more) and at least RDF triples (depend on the scale of local Wikipedia) are found and update to Wikipedia content. The number of new categories can depend on the agreements of local Wikis.

Get involved[edit]

Participants[edit]

I did develop Alphama Category tool [3] which automatically classify categories for Vietnamese category structure. AlphamaBot contributed more than 100000 edits for Vietnamese Wikipedia. This program is written in C# and the outcome is similar to RDF triples (X-category of-Y). Then, I use AutoWikiBrowser to insert these triples into Wikipedia content.

Currently, there are some members in this project. I and a Vietnamese programmer are collecting data. I contacted to some translators which are from Indonesia and Philippines.

Community Notification[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

I noticed idwiki, mswiki, thwiki, tlwiki, nowki and viwiki about my project.


Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • Community member: add your name and rationale here.
  • This seems doable, but I think it should be done through use of Wikidata. The general idea is to set up an item and use that to link a property instance of and a property parent. See Wikidata Category:Sør-Aurdal for an example. If an item has a property instance of pointing to Wikimedia category page (P31) an automated process can localize the label and use that for the category page title. The local page can then use the sitelinks to identify the same page in enwiki and then link the category. The user should be prompted to confirm the automatically created name. Now it gets interesting because the categories will be extremely fragmented if done this way, so articles should be moved up (or down) the category tree to make it balanced. If a wiki has to few articles in a category, then the article should be categorized in the parent category. If that to has to few articles the superparent category should be used. If the number of members in a category grows, then an attempt to move articles down to a subcategory should be attempted. This way the category tree will be kept balanced locally even if the ultimate goal is a much more intricate category tree. I can endorse something like that, or something that tries to partly build it. — Jeblad 11:27, 21 September 2014 (UTC)
Yes, thank you. It is a great idea. But I wonder that we may find not many categories by your way. Just main categories? Alphama (talk) 02:20, 28 September 2014 (UTC)