Grants talk:IEG/Semi-automatically generate Categories for some small-scale & medium-scale Wikis

From Meta, a Wikimedia project coordination wiki

Need help finalizing this proposal?[edit]

Hi Alphama, thanks for drafting this proposal! I just wanted to let you know that we're hosting a few more IEG proposal help sessions in Google Hangouts and IRC next week, so please join if you'd like to get some extra help or early feedback as you finalize your submission. Once you're ready to submit it for review, please update its status (in your page's Probox markup) from DRAFT to PROPOSED.

I'm also pinging Asaf who may have input or thoughts to offer you along the lines of small-wiki perspectives, and Libcub who is conducting research on English Wikipedia categories and who may also have some thoughts or info to offer based on what he's been learning :) Best wishes, Siko (WMF) (talk) 18:57, 19 September 2014 (UTC)[reply]

I did not know how to change the project status. Thank you for information. Alphama (talk) 14:02, 22 September 2014 (UTC)[reply]

Technical details[edit]

Hi. Thank you of taking the initiative to write this idea up. I have a few points that I would like your help clarify:

  1. Would the bot add existing articles to the newly created categories or not?
  2. Would the bot link the created categories to their respective pages on wikidata?
  3. If the answer is Yes to #1, would the process result in creating empty categories if there are no articles to be added to?
  4. Would there be any mechanism to merge the new categories in the existing category taxonomy on the local Wiki? or would it result in a completely new hierarchical set of categories?
  5. Has there been any community discussion or poll to get the general sense of what users think about this idea and what their concerns might be? I understand this would be part of the project, but it would be too late if the community turn out not supporting this idea after the project is funded.

Thanks!--HaithamS (WMF) (talk) 14:39, 20 September 2014 (UTC)[reply]

Hello HaithamS,

Sorry for my English is not perfect. During more than 2 years, I have used my bot [1] to classify articles into categories based on English category taxonomy in Vietnamese categories. As I assumed that English taxonomy had a more fine-grained structure so Vietnamese will follow that taxonomy. As you knew, Vietnamese Wikipedia had thousands of biological stub articles and therefore to enrich these articles, I chose this method. For example, [2]. This bot worked well with more than thousand edits. To scan the RDF triples (article-belongto-category) I wrote my own program in C#, use some basic APIs and named Alphama Category.

Firstly, I have to say my project will deploy depends on the local community agreements of each Wiki. I have two ways to create new categories:

  1. The first solution. Create new categories which are popular, really basic, needed such has Category:Births by year, Category:Education by country, .. and these categories can be empty. I will try to add parent categories into these categories as many as my bot and translators can. But of course, there are still many empty categories. So how to define which categories are popular? Yes, we can count the number of sitelinks of Wikidata of certain categories. I believe in future with the development of each Wiki, they must need these categories. The point is they accept them empty or not. If they said NO, it's OK we move to the second solution below. If they say OK, then I use separately Alphama Category. I did it in some Wikis, such as Thai Wikipedia, Bahasa Indonesia. Allow me to call these categories are core categories.
  2. The second solution. I use my bot scan all articles of each Wiki which have interwiki links with English. Supposed that article A of language A has interlink with article B of English (sitelink on Wikidata). So I compare A and B. With B, I get all B's categories and try to convert them to language A (once again use sitelinks of Wikidata). For categories can be converted, I check in A content had or not. If not, I add these categories into A. For articles in English which can not convert, I use NLP patterns to determine what its name in local language as well as its parent categories (or depend on translators, but I don't expect that I will need many translators here because it is not so difficult).
  3. Currently, my bot leave the old method of connecting interwiki link [[en:________]] and let other bots connect later. I did not pay attention to it because there are so many bots did this task. If needed, I can update to do this.
  4. Category taxonomy is a complex tree. I found myself that each Wiki has its own structure which is formed by editors and their errors. I respect the local community taxonomy first. The point is my bot will depend on English so with a new category A, it will create A's parents which existed in local language. I should give an example.
  • Category B of language English had parents B1, B2, B5, B7, B110.
  • Category A of language Vietnamese is new. I create it. A and B have interlinks. B1 has interlink with A1. B110 has interlink with A110. So I infer that category A will have two parents A1 and A110. I try to create at least 1 parent of any new category.

Yes, your last question is very good one and it makes me hesitate and wonder. Everybody can contribute to Wikipedia freely. So do I. My second solution is guaranteed that my bot can contribute edits based on English taxonomy so I try to convince ppl here is that is the common trend of Wiki development and these articles need to have. They are like common knowledge that everybody should know. For example, article Mathematics must have category Formal Sciences. To classify articles into their proper categories like we use tags for news articles. If they disagree, they can take it out. But it makes sense that there are not so many ppl in small-scale Wikis want to do this task because it is boring and repetitive. Other point is actually my project can work in all languages. If a language community did not accept my idea, I may change to other language and ask them how they feel. I believe there is rarely ppl disagree this thing. Alphama (talk) 14:56, 22 September 2014 (UTC)[reply]

PS: Actually, the number of Vietnamese editors are limited. By this way, we can improve our project. My idea came from when Cheer-Bot created new categories, I have to do something to enrich that content. I did do partly some other projects such as detecting interlinks, enrich references, .. Typical example. Alphama (talk)


Thanks for the response Alphama. Yes, everybody can contribute to Wikipedia freely, but that doesn't mean that everyone will contribute to Wikipedia constructively. You are probably aware that Bots should be programmed and created based on community consensus which is usually based on the potential benefits that the bot will create while causing minimum disturbance. Just to be clear, I like this project, but I think it's very important for this proposal to address the following points before it's considered for funding :

  1. Community endorsement of the bot operation on the local wiki
  2. No mass creation of empty categories
  3. Minimum number of redundant new categories
  4. Linking the new categories to Wikidata
  5. Adding parent categories to all new categories
  6. Adding corresponding articles to all new categories.

Please let me know if any of the above points are unclear, and I will be happy to explain more. Good luck! --HaithamS (WMF) (talk) 20:15, 22 September 2014 (UTC)[reply]

I agree that we need the community endorsement. In my own experiences, I had something like this. I opened some discussions, invite as many editors as I can. Nobody wanted to discuss or give any ideas (for small-scale wikis, not many editors) or they did not feel interesting or they did not care, ... I noticed them that I will do that, do this ... but no one joined. Then, when my bot contributed with hundred edits, suddenly several editors appeared and said that they disagreed. Then, I discussed with them to find the agreements but you know that we coud not have the same voice in some cases. Here, I want to say clearly, I will try my best to seek for the agreements but not always we can do this. What will I do next? If I revert my edits, it means my project did not help that Wikis. I expect these cases not many. If I continue, it means some editors may see that my project destructs their Wiki?

I don't try to open any problem but here frankly I must clarify you know about the community endorsement. Normally, it is not so difficult but in some cases, it is quite complex. For example, in Bahasa, article A is a stub article (few sentences) about Fish, my bot contributes Category:Fish stubs but somebody reverts because he does not want to see stub category to show that his Wiki is still poor. That is why in my last step of project, I have to gather the community opinions and make some assessments as well as error fixes based on their requirements.

So here I try to seek for community endorsement at the first step and in last step, I will try to seek again by collecting their opinions and fixes my edits based on their opinions as much as I can. Alphama (talk) 02:54, 23 September 2014 (UTC)[reply]

Translators & Linguists[edit]

About translators & linguists, this budget can be cut off because there are already many translators & linguists in their local wikis and they still use their knowledge contributing Wikipedia for free. You don't need to employ any of them as you can ask them freely. --Octahedron80 (talk) 01:31, 27 September 2014 (UTC)[reply]

PS. With my programming proficiency, the project would be easy if one focuses on a pair of wikis like my bot does. I even could make the entire project for free, and within a few months! --Octahedron80 (talk) 01:42, 27 September 2014 (UTC)[reply]

Thank you for your comment. If you are interested in this project, you can join or propose your new idea at Grants:IEG. This will help Wikimedia a lot to improve their movement. In this project, I am afraid of some small-scale Wikipedias don't have many translators & linguists when I need for this project. So I invite many of my friends (from Thailand, Indonesia, the Philippines) who will help me to do these tasks. Of course, the community involvement is very important and I expect to have cooperation with as many editors as possible. However, user contributions may depend on their mood and their time. I assume that with thousand new categories (after translating by bot) in each Wikis, who will assess they are correct? And for thousand RDF triples (category-belongto-article) who will do this task also? Remember that any over-budget will be refunded to Wikimedia according to their policies. That would be nice to see some editors of Thai Wikipedia can improve the similar tool and expect to see this Wiki development also. Alphama (talk) 15:55, 27 September 2014 (UTC)[reply]
What is your definition of 'small-scale'? Projects such as thwiki mswiki idwiki couldn't be judged small due to their number of articles, frequency of edits, and active contributors; they are likely to be medium in my opinion. I have been in Thai Wikipedia for 8 years. --Octahedron80 (talk) 08:17, 28 September 2014 (UTC)[reply]
Ah Yes, Thai Wikipedia is a excellent project with thousand of editors and well programmers. I may change the title to some Wikis better. Thank for your comment. About the definition, it can be discusssed in somewhere else. Alphama (talk) 12:25, 28 September 2014 (UTC)[reply]

Experiences from nowiki[edit]

It seems like this could be a good idea, but it also has some serious drawbacks. The most obvious one is that there are differences between the category hierarchies on different wikis. Whats not so obvious is that some of the differences have a cultural context. In some cases those differences are spelled out on policy pages, but more often than not such written documentation are missing. Sometimes those cultural differences manifest in slightly different names that can be difficult to sort out if you don't know the cultural context. At nowiki we accept categories on nationality, but we try to avoid categories on ethnicity. In some cases (ie ethnicity) we want a reason in the content of the page, and that reason should have a reference. In short we should have a blacklist of categories that should not be added unless a human does the editing.

It is possible to fix the problem, and others to that were spotted, but it will take some time to sort them all out. That could be a good reason to put some additional effort into this, as that would give some incentive to sort out those really difficult questions which we do need some answer to. — Jeblad 16:30, 29 September 2014 (UTC)[reply]

Yes, I agree. Different wikis may have different category name convention. I will add the blacklist to my work. The first step is to create and classify the common categories. This step can be persistent and applies well for all languages. The second step is to create new categories which are very basic and needed for that Wikis, for example: Category:Years, Category:1, ... I prefer to contribute to all Wikis but the small-scale Wikis where the category structure is still primitive for somehow looks more proper for this project. However, I will also research about nowiki and medium-scale Wikis to seek for any good solutions. Alphama (talk) 06:11, 30 September 2014 (UTC)[reply]
In this project, at first I try to do for all languages. But, from some drawbacks mentioned above, I may focus on some specific languages. Alphama (talk)

Question about your proposal’s scope[edit]

Thank you for your submission, Alphama. Because IEG funds 6-month experiments, we have a question about your proposal:

  • Regarding the “Running & Maintenance” line item listed on your budget, would you please provide clarification on what “running & maintenance” entail? Why is this funding required for an additional 6 months after having created and launched your tool?

Looking forward to your reply soon so we can determine whether this proposal would be eligible for IEG review in this round! -- Thanks, Jtud (WMF) (talk) 21:46, 2 October 2014 (UTC)[reply]

I will hire some machines (laptops, desktops or may be EC2) to boost the speed of my tools when they access Wikis throughout APIs in Web environment. I use my tool to sort, format categories of certain articles and compare the relationship between these categories with categories to be added (parent, redirects, children) to determine should we need to add new categories or not. This process can be a long time (from several hours to a day). I continue to maintain this project after 6 months to research how stability that this project can bring to the local Wikis. I am afraid of some new categories may not adapt to the local Wikis culture in future even though we have some discussions and agreements with the local Wikis or we can not figure out all the cases. For example, the plural problem we have like Category:Languages and Category:Language are quite different at English Wikipedia but for many Wikis, many editors consider they are one. When we create these categories comply to Wiki agreements but may be after 6 months they will change their mind. Alphama (talk) 00:54, 3 October 2014 (UTC)[reply]
Thanks for this clarification, Alphama. From what I understand here, you're thinking about the longer term maintenance and improvements to the tool as a second phase, once you've developed something that has proven success in the first 6 month project. I think we can go ahead and consider this proposal eligible for now based on this, and come back to discuss the budget and maintenance questions in more detail if this project is recommended for funding by the community. Best wishes, Siko (WMF) (talk) 19:28, 7 October 2014 (UTC)[reply]

Eligibility confirmed, round 2 2014[edit]

This Individual Engagement Grant proposal is under review!

We've confirmed your proposal is eligible for round 2 2014 review. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period.

The committee's formal review for round 2 2014 begins on 21 October 2014, and grants will be announced in December. See the schedule for more details.

Questions? Contact us.

Siko (WMF) (talk) 19:28, 7 October 2014 (UTC)[reply]

Discussion and consensus building should precede the development work[edit]

I agree with the comments made above by Haitham: the discussion and consensus-building for the proposal should precede the time (and funding) spent on the development of the software. While I understand the enthusiasm and eagerness to get on with the "actual" work, this community-organizing work is in fact actual work, and, indeed, is the more significant actual work in this case. It does not make sense to only build in the community discussion as a final part of this project.

I would encourage you to withdraw the proposal, to discuss the project on the wikis you proposed to deploy it in, and perhaps also on some broader technical venues like the Mediawiki wiki or the wikitech mailing list, and to re-submit the proposal if and when there is a clearer idea of the plan being acceptable to the wiki communities, and the assumptions you make (chief among them that the particular category structure of the English Wikipedia would definitely be [considered] a Good Thing in the eyes of other communities) are found to be validated in fact. Asaf (WMF) (talk) 03:30, 16 October 2014 (UTC)[reply]

This is my first time to submit my proposal here and it seems to be a difficult time. I did do the main part of this project at Vietnamese Wikipedia during 1 year. If I withdraw now, I need to wait for at least 4-5 months to submit again (next time). By the way, I did not see any projects come from viwiki during 10 years. It seems to be I need to discuss with viwiki and proceed the project alone now. I am appreciated for all comments above. Alphama (talk) 15:08, 16 October 2014 (UTC)[reply]
If it is acceptable to the viwiki community, that's great. But to fund an expansion of your work along the lines in this proposal, we'd expect some preliminary community work to establish the technical work would be welcome, yes.
Please don't be discouraged! It is the nature of grant-making that some proposals don't quite make sense to fund as they are, and need some re-thinking or preliminary work before they can be funded. I would personally be interested to learn more about your previous technical work within viwiki. Would you like to perhaps have a teleconference (Skype/Hangout) call to discuss this? Asaf (WMF) (talk) 20:26, 16 October 2014 (UTC)[reply]
For what we discussed, what I understand now that the best way to submit any proposal here is to have/must have a successful work or nearly completely work on Wikis and gain the community support. If so, it sounds like we submit results not proposal. I should know that. I created a tool which complies to Help:Category for second phrase of this project.The process is to collect English relations (article-belongto-category), format categories of viwiki (category tags in articles), check and remove redudant parent-children relations on articles of viwiki, check redirects and add these new relations to articles.
Are you sure you really want to discuss about it? If not, you don't need to do that to show your appreciate. Btw, what you said seem to be correct. In your position, I quite understand you want to be sure 100% everything is OK and nothing goes wrong before granting any proposal. I should withdraw this project and continue to do it independently. Thanks for taking your time to review my proposal. Alphama (talk) 08:23, 17 October 2014 (UTC)[reply]
No, it is absolutely not the case that we expect guaranteed results from every grant. Indeed, grants are often experiments, and we at WMF certainly take into account that some grants will not meet their goals. What we are looking for is grants that, at the time of the funding decision, are at least well-positioned to succeed, in terms of a favorable environment, including community approval, and in terms of resources and competencies. That is what we have been encouraging you to demonstrate, above. I understand language may be a barrier to understanding here, and perhaps a video call would allow clearer communication. Let us know if you'd like to have such a call. Asaf (WMF) (talk) 17:20, 20 October 2014 (UTC)[reply]
I wonder if this project is one where the technical problems are such that a community consensus can't be reached without having a working prototype, and proper operation of that prototype can't be verified without people from the community that has experience in that particular language using the tool. So the tool must be changed from something a user outside the particular community use in an off-line mode into something any user in the community can use in an interactive on-line mode. Some of the problems this project run into at nowiki were language and culture specific, and those problems can't be resolved without involving people from the community. Because people must be able to work on this when they have time, not when the external user have time, the conclusion as I see it would be that the tool must be interactive. Perhaps something along the lines of the GLAM toolkit?
So my idea is; give a grant for a feasibility study and perhaps do some initial prototype work together with WMF. Get input from the communities, adjust and make a final tool. — Jeblad 12:33, 17 October 2014 (UTC)[reply]

The point is we don't have a strict standard of category taxonomy for all languages. Therefore, we suppose that may be English taxonomy could be a best standard because it was researched by many researchers and have the highest collaborative quality. I knew many scholars may disagree with this but what they did just research not contribute much to this taxonomy. In each language, it has its own category taxonomy based on the understanding of editors. Some tools I found on Wikipedia just about formatting categories, changing the redirect categories or removing parent-children categories.

The main question is how to classify articles into categories properly and which standard will you believe that it is correct. It is impossible to gain the community agreements in some cases because they mainly depend on their understanding.

I may create this tool and let the community contribute anything they prefer. I don't know now I get little bit confused now but I believe that I need to discuss with viwiki and continue to contribute to this project in longterm. I will be ready share this tool for any wikis really need it. Alphama (talk) 14:14, 17 October 2014 (UTC)[reply]

@User:Jeblad: I roamed around more than ten wikis already and what I received is the nonchalant attitude of editors. I don't think I have a chance here but I will be back to nowiki in some days. Hope you still be there and help me then. Thank you! Alphama (talk) 14:17, 17 October 2014 (UTC)[reply]

  • I am also not sure that we can fund the project that will have impact on the overall wiki-project (hierarchy of categories, their massive creation, etc.) without clear consensus of the wiki-community. I understand that it won't be fast but I think that we can't go ahead with prior discussion with local community: if they agree with the overall idea, we can fund it; if we don't know it - it's too risky to start such a project rubin16 (talk) 13:38, 18 October 2014 (UTC)[reply]
Is it possible if I don't need the community agreements anymore? I want to develop this tool for research purpose only? Then, I may suggest this tool for community later. Editors will contribute depend on what they want. Alphama (talk) 11:28, 19 October 2014 (UTC)[reply]
In such case you should state another problems to be solved, another decisions proposed, etc. as you won't need to create categories and the project will execute some other works, if I understand correctly. rubin16 (talk) 16:27, 19 October 2014 (UTC)[reply]

NLP experience[edit]

I love everything NLP but the proposal lacks information on NLP experience (on the field/academic/professional) of the proposer. --Nemo 09:14, 17 October 2014 (UTC)[reply]

Can you please point out and give me more suggestions? Now I do this indepently but your opinions are still helpful. I just use some NLP patterns in this project not all about NLP in general. Alphama (talk) 09:21, 17 October 2014 (UTC)[reply]
Err, I'm not the one asking a wage for NLP work, it's your job to show you have an expertise to give in return of the wage. --Nemo 08:33, 29 October 2014 (UTC)[reply]

Aggregated feedback from the committee for Semi-automatically generate Categories for some small-scale & medium-scale Wikis[edit]

Scoring criteria (see the rubric for background) Score
1=weak alignment 10=strong alignment
(A) Impact potential
  • Does it fit with Wikimedia's strategic priorities?
  • Does it have potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
6.2
(B) Innovation and learning
  • Does it take an Innovative approach to solving a key problem?
  • Is the potential impact greater than the risks?
  • Can we measure success?
5.3
(C) Ability to execute
  • Can the scope be accomplished in 6 months?
  • How realistic/efficient is the budget?
  • Do the participants have the necessary skills/experience?
5.3
(D) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
  • Does it support diversity?
2.8
Comments from the committee:
  • Generally like the idea of helping build structure semi-automatically for small language wikis.
  • Worth considering that categories may be sooner or later substituted by structures provided by Wikidata.
  • Unclear fit with strategic priorities. Could be regarded as fitting with "improving quality", but not sure if this is the proposer's actual aim.
  • Ideally executed, it could have impact across projects. Would possibly be sustained or adapted, if it proves useful.
  • Due to variation in the structure of categories between different wikis, it may not be easy to scale across a number of wikis. Each community feels very differently about categories. Could be worthwhile to map how languages differ and why.
  • Approach seems innovative, but risks are high. If executed and implemented carelessly, this project could cause a lot of distress and trouble on small wikis. Proposal is lacking community discussion. Would have liked to see at least one community expressly endorsing the project.
  • It may not be feasible to mass-produce categories which are not needed. This could result in unnecessary extra work for users to tidy up these categories. Would not want to see small wikipedias with only a small number of articles flooded by empty categories if this is not explicitly demanded by the active communities there.
  • The level of experience for this participant to successfully complete this project is unclear. Proposer seems to have skills for setting up the tools as proposed, but no evidence of proposer's ability to explain these ideas to the local communities in the target Wikipedias and to reach consensus there. Does not appear to have enough effort and time scheduled to engage with the communities. Would want to see a larger chunk of the proposed time and effort dedicated to community discussions.
  • In small and medium wikis the articles are few and the categorization managed more easily, so costs/benefits of this project may not be proportional.
  • Success could be measured easily.
  • Proposal/budget is covering more than six months. We would have liked to see more concrete plans.

Thank you for submitting this proposal. The committee is now deliberating based on these scoring results, and WMF is proceeding with its due-diligence. You are welcome to continue making updates to your proposal pages during this period. Funding decisions will be announced by early December. — ΛΧΣ21 17:07, 13 November 2014 (UTC)[reply]

Round 2 2014 Decision[edit]

This project has not been selected for an Individual Engagement Grant at this time.

We love that you took the chance to creatively improve the Wikimedia movement. The committee has reviewed this proposal and not recommended it for funding, but we hope you'll continue to engage in the program. Please drop by the IdeaLab to share and refine future ideas!

Comments regarding this decision:
Categories across multiple projects are very complex and this proposal needed a higher degree of community endorsement to justify the risk of further complicating already complicated processes. A future proposal with more concrete 6-month plans and a strong investment in community consultation would be better received.

Next steps:

  1. Review the feedback provided on your proposal and to ask for any clarifications you need using this talk page.
  2. Visit the IdeaLab to continue developing this idea and share any new ideas you may have.
  3. To reapply with this project in the future, please make updates based on the feedback provided in this round before resubmitting it for review in a new round.
  4. Check the schedule for the next open call to submit proposals - we look forward to helping you apply for a grant in a future round.
Questions? Contact us.