Grants:IEG/Semi-automatically generate Categories for Vietnamese Wikipedia/Final

From Meta, a Wikimedia project coordination wiki


Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project.

Part 1: The Project[edit]

Summary[edit]

In a few short sentences, give the main highlights of what happened with your project. Please include a few key outcomes or learnings from your project in bullet points, for readers who may not make it all the way through your report.

The project helped us to understand the category taxonomy in Wikipedia projects, including Vietnamese Wikipedia (viwiki) and English Wikipedia (enwiki). We learn that most of the editors of Vietnamese Wikipedia refer to the category taxonomy to English Wikipedia to create new categories and organize them by a hierarchy structure. Any category name can be converted to a pattern that is used to translate to other languages based on the partition translation and then combine results before comparing it to a set of predefined category names to decide the best name.

Methods and activities[edit]

What did you do in project?

Please list and describe the activities you've undertaken during this grant. Since you already told us about the setup and first 3 months of activities in your midpoint report, feel free to link back to those sections to give your readers the background, rather than repeating yourself here, and mostly focus on what's happened since your midpoint report in this section.

  • Community online discussion
  • Ideas from a professor
  • Use AWB to restructure category names
  • Create translation tool and perform and improve the translation quality
  • Firstly, I wrote and used my tool (Alphama Category) to scan all Vietnamese categories which have interlinks with English categories at Wikidata. I also organized an online discussion at wiki and a professor's ideas for naming conventions, translations and category standards. Then, I wrote a short summary of this discussion and update these standards. Be sure that you have a very deep conversation with your community (viwiki) about the category standards and ask many members or related stakeholders if you can. I invited more than 1000 editors by message and e-mail but just over 27 members giving me their opinions.
Restructure Vietnamese category names based on new standards
  • From restructured list above, I used AWB and AlphamaBot (talk · contribs) to redirect old categories (Vietnamese colum) to new categories (Vietnamese fixes) and update related children and parent categories. I did not change interlinks at Wikidata because I just received the bot flag at Wikidata recently and still learn how to use pywikibot. I will do this task soon. Many editors will ask you why you change the category names, thus you have to give the change reason in summary box and link you your project. If they have any opinions, that may help the project a lot. There are many triples (categories/pages---belongto---categories/pages) so make sure that you restructure all triples which you have.
  • For the translation, you need to have as much buffer data (temp data) as you can, you can get them directly from Wikidata, Wikipedia or even DBPedia. In my project, I use my program to collect buffer data (even in the running time) because I just want to use which data I really need for the project and filter out the garbage data. Remember that, the capacity of data affects a lot the timing of translation processes. Because I interact with Wikidata, Wikipedia by their APIs so be sure you have a good Internet speed and several laptops/machines (or hire some cloud computers) to help you boosting the execution.
Database Diagram
  • For improving the quality of translation, you have to create a function which allows to change the results, patterns and prefered names of categories. Beside, suggest an similar translation case and the matching score (0-1) between this similar one and your translation.
Name analysis of an English category translated to Vietnamese
  • To manage categories, AlphamaCategory also offers CategoryManagement module.
Category management module

Outcomes and impact[edit]

Outcomes[edit]

What are the results of your project?

  • Alphama Category tool
Alphama Category 1.0.9

Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.

Progress towards stated goals[edit]

Please use the below table to:

  1. List each of your original measures of success (your targets) from your project plan.
  2. List the actual outcome that was achieved.
  3. Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success
(include numeric target, if applicable)
Actual result Explanation
Update new naming conventions for viwiki We update at page vi:Wikipedia:Thể loại We put the word "năm" (year) into every category related to years to clarify the meaning of the category. For example, Thể loại:Khoa học năm 1990 is used instead of Thể loại:Khoa học 1990. We use categories related countries in Vietnamese instead of using its English names, Australia --> Úc, Italy, Italia --> Ý. In the case of prepositions, we use some like "of, in, by" but by different cases.
Gain the attention of viwiki There are 33 members involve to the discussions to improve the category names in viwiki Since the category taxonomy does not pay attention from the community. This project is a good one for seeking collaboration in viwiki and gain awareness of editors about the needed of category names. Our discussion can be found in this link [2]
To create semi-automatically new categories for Vietnamese Wikipedia, help to reduce the manual tasks related to category classification and to create more fine-grained category taxonomy for Vietnamese Wikipedia Integral the category taxonomy to our bot (AlphamaBot), running by AWB (AutoWiki Browser, a semi-automatically tool). Since then, the category taxonomy in viwiki now is very mature compare to many other projects AlphamaBot can work well now, for new articles in viwiki, it can be translated and put into new articles.


Think back to your overall project goals. Do you feel you achieved your goals? Why or why not? We already archived our goal since then this project help to us update our category taxonomy and created many new category names in viwiki.

Global Metrics[edit]

We are trying to understand the overall outcomes of the work being funded across all grantees. In addition to the measures of success for your specific program (in above section), please use the table below to let us know how your project contributed to the "Global Metrics." We know that not all projects will have results for each type of metric, so feel free to put "0" as often as necessary.

  1. Next to each metric, list the actual numerical outcome achieved through this project.
  2. Where necessary, explain the context behind your outcome. For example, if you were funded for a research project which resulted in 0 new images, your explanation might be "This project focused solely on participation and articles written/improved, the goal was not to collect images."

For more information and a sample, see Global Metrics.

Metric Achieved outcome Explanation
1. Number of active editors involved 33 members They involve to the discussion about category names. A few members are interested to propose some ideas in technical issues.
2. Number of new editors a few members This is a technical project so we only gain some members to work with us in the Bot project [3]
3. Number of individuals involved 33 individuals They involve to the discussion about category names. A few members are interested to propose some ideas in technical issues.
4. Number of new images/media added to Wikimedia articles/pages
5. Number of articles added or improved on Wikimedia projects about 4000 new categories The list can be found here [4] [5]
6. Absolute value of bytes added to or deleted from Wikimedia projects about 4000 new categories The full list can be found here [6] [7]


Learning question
Did your work increase the motivation of contributors, and how do you know?
  • Yes, since our project, we realize that the number of categories in Vietnamese increasing, from 137k to now is 240k [8].

Indicators of impact[edit]

Do you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.

Option A: How did you increase participation in one or more Wikimedia projects?

Option B: How did you improve quality on one or more Wikimedia projects?

Option C: How did you increase the reach (readership) of one or more Wikimedia projects?

  • We choose option B. Our project helps to improve the quality of category taxonomy in viwiki and may apply for small and medium other projects who need to create and classify many important and needed categories.

Project resources[edit]

Please provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.

Learning[edit]

The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.

We learn a lot about the nature of category taxonomy and how to analyze these category names to translate to other languages, such as in our case, Vietnamese. We realize that by using this way we can reduce man efforts of amatuer editors in Wikipedia who are struggling with creating new category names and confused about which one is the better option to translate.

What worked well[edit]

What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

  • Your learning pattern link goes here

The translation works very well with popular categories or categories containing popular terms inside. These terms already appear in the dictionary and in the interlinking structure of Wikidata so our duty just to combine it and check for the best option by our NLP algorithm, the similarity of terms.

What didn’t work[edit]

What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.

  • For a category name contains terms (words) do not appear in the dictionary. It is very hard and tricky when on the Internet, there are many versions that were translated not from the experts and even experts sometimes argue about the translated names.

Other recommendations[edit]

If you have additional recommendations or reflections that don’t fit into the above sections, please list them here.

  • We mention about using Deep Learning and other NLP algorithms to improve the translation quality.

Next steps and opportunities[edit]

Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.

Think your project needs renewed funding for another 6 months?




Part 2: The Grant[edit]

Finances[edit]

Actual spending[edit]

Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.

Remaining funds[edit]

Do you have any unspent funds from the grant?

Please answer yes or no. If yes, list the amount you did not use and explain why.

  • No

If you have unspent funds, they must be returned to WMF. Please see the instructions for returning unspent funds and indicate here if this is still in progress, or if this is already completed:

Documentation[edit]

Did you send documentation of all expenses paid with grant funds to grantsadmin(_AT_)wikimedia.org, according to the guidelines here?

Please answer yes or no. If no, include an explanation.

  • No. Since we create a project based on technical and online collaborations, we do not know how to grant the receipt.

Confirmation of project status[edit]

Did you comply with the requirements specified by WMF in the grant agreement?

Please answer yes or no.

  • Yes

Is your project completed?

Please answer yes or no.

  • Yes

Grantee reflection[edit]

We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!


Expense Approved amount Actual funds spent Difference
Translation and collect NLP patterns 1000 USD 1000 USD 0 (compare to the mid-term we gain more 500 USD to hire translators to check the correctness of category names)
Design & building program 4500 USD 4500 USD 0
Hire machines to hang out to retrieve data from Wikipedia by APIs 500 USD 500 USD 0
Support discussions, research solutions (or make some surveys) 1000 USD 1000 USD 0