Grants:IEG/Semi-automatically generate Categories for Vietnamese Wikipedia/Midpoint
This project is funded by an Individual Engagement Grant
Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.
We observed that Vietnamese Wikipedia (viwiki) did not have a strict naming convention for category names as well as a strong category structure. Although viwiki has its own category standard which can be found at vi:Wikipedia:Thể loại, it seemly offer not enough classification requirements and can not solve thoroughly naming problems.
In our discussion, many editors stated that they manually add and create categories into articles with their understanding. This could lead to the incorrectness of category tree. Some articles were arranged to parent categories instead of their children or contained both parent-children categories which are abundant. What we expect here is an explicit category structure so readers can search articles easily and the editors know exactly how to classify categories. This project aims to create new categories semi-automatically by using a method, which is based on the combination of NLP patterns, English taxonomy and Vietnamese category conventions.
Vietnamese Wikipedia had a discussion about naming conventions of category. There are 26 members and 1 professor who particiated and gave their suggestions. This discussion came to some outcomes which can help viwiki editors define clearly how to create new categories.
Methods and activities
How have you setup your project, and what work has been completed so far?
- Community online discussion
- Ideas from a professor
- Use AWB to restructure category names
- Create translation tool and perform and improve the translation quality
Describe how you've setup your experiment or pilot, sharing your key focuses so far and including links to any background research or past learning that has guided your decisions. List and describe the activities you've undertaken as part of your project to this point.
- Firstly, I wrote and used my tool (Alphama Category) to scan all Vietnamese categories which have interlinks with English categories at Wikidata. I also organized an online discussion at wiki and a professor's ideas for naming conventions, translations and category standards. Then, I wrote a short summary of this discussion and update these standards. Be sure that you have a very deep conversation with your community (viwiki) about the category standards and ask many members or related stakeholders if you can. I invited more than 1000 editors by message and e-mail but just over 27 members giving me their opinions.
- From restructured list above, I used AWB and AlphamaBot (talk · contribs) to redirect old categories (Vietnamese colum) to new categories (Vietnamese fixes) and update related children and parent categories. I did not change interlinks at Wikidata because I just received the bot flag at Wikidata recently and still learn how to use pywikibot. I will do this task soon. Many editors will ask you why you change the category names, thus you have to give the change reason in summary box and link you your project. If they have any opinions, that may help the project a lot. There are many triples (categories/pages---belongto---categories/pages) so make sure that you restructure all triples which you have.
- For the translation, you need to have as much buffer data (temp data) as you can, you can get them directly from Wikidata, Wikipedia or even DBPedia. In my project, I use my program to collect buffer data (even in the running time) because I just want to use which data I really need for the project and filter out the garbage data. Remember that, the capacity of data affects a lot the timing of translation processes. Because I interact with Wikidata, Wikipedia by their APIs so be sure you have a good Internet speed and several laptops/machines (or hire some cloud computers) to help you boosting the execution.
- For improving the quality of translation, you have to create a function which allows to change the results, patterns and prefered names of categories. Beside, suggest an similar translation case and the matching score (0-1) between this similar one and your translation.
- To manage categories, AlphamaCategory also offers CategoryManagement module.
What are the results of your project or any experiments you’ve worked on so far?
- New naming conventions
- Collect more than 47k categories which has interwiki links with English categories.
- Write a draft report
- Grants:IEG/Semi-automatically generate Categories for Vietnamese Wikipedia/Data
- Translation tool
Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.
Please take some time to update the table in your project finances page. Check that you’ve listed all approved and actual expenditures as instructed. If there are differences between the planned and actual use of funds, please use the column provided there to explain them.
|Expense||Approved amount||Actual funds spent||Difference|
|Translation and collect NLP patterns||500 USD||500 USD||0|
|Design & building program||2500 USD||2500 USD||0|
|Support discussions, research solutions (or make some surveys)||1000 USD||1000 USD||0|
Then, answer the following question here: Have you spent your funds according to plan so far? Please briefly describe any major changes to budget or expenditures that you anticipate for the second half of your project.
The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you are taking enough risks to learn something really interesting! Please use the below sections to describe what is working and what you plan to change for the second half of your project.
What are the challenges
What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.
- The disagreement of community about category names which contain prepositions (of, by, in, ...), plurals, VBN, ... These problems will take out of the project so far.
- The lack of unification and diversity in setting category names based on NLP Patterns. We try to scale down the scope and solve every case in detail and inherit from previous category names.
What is working well
What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
- The attention and agreement of community in defining clearly category standards and related things.
- I understand category structure deeper and know how to some useful methods to translate from English categories to other languages.
- How to enrich article content by adding categories
- Understand the translation process and combine different techniques to improve the translation quality.
Next steps and opportunities
What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.
- Continue to restructure category tree and generate new categories for new articles. Currently, we translate 468 new categories and hope this number will improve more than 5000 categories with their classification.
- Finish writing reports
- Receive feedbacks
We’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?