Grants talk:Programs/Wikimedia Community Fund/Rapid Fund/Comprehensive anti-spam external link service (the Citron project) (ID: 22754010)
Add topicFollow-up questions on your grant application from ESEAP Funds Committee and Programme Officer
[edit]Hello @Plantaest,
Thank you for putting in a grant application.
And also for successfully submitting your final report for the project on Comprehensive link checking tool (ID: 22451570). Congratulations once again.
Please find the consolidated questions/comments about your application:
1) I would be more comforted if the application further expanded on how this differs from the existing blacklist tools we have, if it builds on any tools, how it interacts with filters etc. It would also be good to get input from other technical editors, as opposed to juts community editors.
2) [a] How do you plan to introduce this project to Vietnamese Wikipedians? Are there any regular meetings for the Vietnamese Wikipedia community? [b] Could you consider adding more people to this project? E.g: Communication (engage with community, do a survey, etc)
3) At the end of this project, how much human verification will still be needed to maintain the operational of this citron project?
4) a) I counted the budget. It will be 90 days nonstop; 5 hours a day. Seems like a part time with 55USD/day. No holiday at all for 3 months? b) How the machine knows if the link is "good" or "bad"? Could you please share more details? c) The applicant wrote that the admin will add the "bad external" links to the blacklist manually. What does he mean? Is it a heavy lift by the admin and would the admin voluntarily do that? Or does the applicant plan to do this during the project? d) There are 10 volunteers who will involve: give review and feedback, will you plan to compensate them for their time? e) Suggestion: open this project in ESEAP monthly meeting community to get any feedbacks too
5) Can this be used for other language wikis?
6) About the proposed budget, we noticed that there is an increase in per hour rate from the previous grant. Could you provide more details on the per hour rate estimates?
7) From this project application, it is challenging to understand what is the potential impact of your project and how useful it will be for the community. Could you be able to help us understand/ envision this part better?
8) From this project application, it is also not clear how this project is different from the previous application, or perhaps, build on your previous work. Could you help us understand this part better as well?
9) Is the code to be public and under an OSI approved free software license?
Thank you in advance for taking time to respond.
We'll also be sharing your application with the Product and Technology team for further review and will be posting additional questions and feedback from that process when we have the details next week.
We noted that the proposed start date of the project is 1 October 2024 and hope to provide another status update before then.
Regards,
Jacqueline on behalf of the ESEAP Funds Committee JChen (WMF) (talk) 06:41, 19 September 2024 (UTC)
Additional questions from Product and Technology team
[edit]Hello @Plantaest,
Thank you once again for your application. I am sharing additional questions for your response. Thanks for taking time to support us in understanding your project better.
10) I'm also curious what "some supplementary data" implies exactly. Where does it come from, how is it created, is it open?
11) To evaluate the effectiveness of the machine learning model in checking and classifying links, I plan to use basic metrics such as accuracy, precision, recall, and F1 score. We gather this is a common research approach and wonder if you could also speak to the follow-up work that might be needed, if any?
12) As this is a project that will be led by you, we would recommend that your work be well- documented so that someone else (if needed, and if there is interest) could take over the codebase next time?
13) We observe that the timeline in the application is bit high level. For future applications, it would be helpful to provide more details.
14) For consistency of terminology, we suggest adopting "allowlist"/"blocklist" instead of "whitelist"/"blacklist"
Thank you.
Regards,
Jacqueline on behalf of the team providing additional review JChen (WMF) (talk) 02:59, 23 September 2024 (UTC)
- @JChen (WMF): Hello Jacqueline Chen, it's a pleasure to meet you again today. Please accept my apologies for the delay in responding, and thank you for accepting the report of the previous project. I will respond to your questions in the most understandable way, as you suggested in your email.1) I would be more comforted if the application further expanded on how this differs from the existing blacklist tools we have, if it builds on any tools, how it interacts with filters etc. It would also be good to get input from other technical editors, as opposed to juts community editors.Traditionally, to manage cases where vandals add bad links, we usually use techniques such as the blacklist (like mw:Extension:SpamBlacklist) or filters (like mw:Extension:AbuseFilter), as you mentioned. These are tools already integrated into Wikimedia's wikis. These tools are considered passive mechanisms. We have to manually find bad links somewhere and then add them to the blacklist/filter so that no one can add these links to articles anymore. The tools themselves won't help us find these links. Searching for these bad links is very labor-intensive because editors have to check each article revision to see if any bad links have been added. Editors often review edits through the RecentChanges interface (mw:Help:Recent changes). As stated in the proposal, the behavior of inserting bad links has become more sophisticated in the Vietnamese Wikipedia case, and vandals have found ways to evade editors, allowing bad links to persist for a long time, violating Wikipedia’s "no advertising" rule.The Citron tool is an active mechanism, which is the most significant difference. It continuously scans the RecentChanges feed automatically to check the latest edits on articles, extracts links, evaluates whether they are good or bad, and generates reports. Thus, humans don't need to manually search for links but only need to review the reports generated by Citron. This is the main advantage of the tool.Citron is my own creation and does not build on any existing tools, although it does use some data that the community has accumulated over the years to create a machine learning model for evaluating links.Citron does not seem to affect filters. Its scope of operation is limited to scanning RecentChanges and generating reports, while adding links to filters is at the discretion of the administrators.Regarding input from technical editors, in a discussion with the Vietnamese Wikipedia community: vi:Wikipedia:Thảo luận/Dự án Thanh yên, some community members interested in technical matters, such as Hide on Rosé (templateeditor), NgocAnMaster (templateeditor), Pminh141 (templateeditor), and Trần Nguyễn Minh Huy (interface-admin), all agreed with my draft: vi:Thành viên:Plantaest/Citron, even though I invited feedback on any aspect of the software. I believe that the draft I wrote was clear enough for those participating in the discussion to understand what I intended to do.2) [a] How do you plan to introduce this project to Vietnamese Wikipedians? Are there any regular meetings for the Vietnamese Wikipedia community? [b] Could you consider adding more people to this project? E.g: Communication (engage with community, do a survey, etc)[a] According to the proposal, I plan to introduce this project to the Vietnamese Wikipedia community in the third month of the plan, after building the core parts of the project. The location will be vi:Wikipedia:Thảo luận, which is a "forum" where editors can discuss any issues that require feedback from many people.The discussion will take place continuously over a month, so there will not be any "regular meetings". Most Vietnamese Wikipedia members are volunteers who don't have much time to stay online constantly, making it impossible to gather everyone at the same time. Therefore, discussions usually last quite a long time and are concluded at a certain point.[b] I believe not many people are needed for this project. This is not a large-scale software project, which aligns with Rapid Fund guidelines, so I can handle all the work.3) At the end of this project, how much human verification will still be needed to maintain the operational of this citron project?Human verification can be optional, not mandatory. The core mechanism of Citron is using a machine learning model, which can operate automatically without human intervention. However, as machine learning models are not 100% accurate (of course), human review can play a role if the wiki has enough manpower. This depends on the scale and culture of each wiki community. If there is no human involvement, we will rely on reports solely from the machine learning model.4) a) I counted the budget. It will be 90 days nonstop; 5 hours a day. Seems like a part time with 55USD/day. No holiday at all for 3 months? b) How the machine knows if the link is "good" or "bad"? Could you please share more details? c) The applicant wrote that the admin will add the "bad external" links to the blacklist manually. What does he mean? Is it a heavy lift by the admin and would the admin voluntarily do that? Or does the applicant plan to do this during the project? d) There are 10 volunteers who will involve: give review and feedback, will you plan to compensate them for their time? e) Suggestion: open this project in ESEAP monthly meeting community to get any feedbacks tooThis is an important question, and I will answer it to the best of my ability.(a) This is similar to my previous project. I work every day, which is normal for me. In a typical working session, I spend a lot of time thinking. I don't take any holidays except for the Lunar New Year.(b) This is a complex issue to explain, so I'll make it as simple as possible. If you use ChatGPT and ask, "Is 'moew moew' the sound of a cat or a dog?", how can ChatGPT answer that? You likely know that people have already provided ChatGPT with the necessary data to answer your question. So, the answer here is "data". What kind of data can be used to evaluate whether a link is "good" or "bad"? I have described this in the draft: vi:Thành viên:Plantaest/Citron#Đặc tả sơ bộ. It could involve the presence of advertising terms in the domain name, the domain's age, or the site's ranking. These data will be collected, labeled, and used to create a machine-learning model.(c) Why must administrators decide whether to add "bad" links to the blacklist? Because, technically, only administrators can edit the MediaWiki blacklist, and my software will not interfere with that to ensure security. Administrator rights are very sensitive. Currently, Vietnamese Wikipedia administrators already add bad links through community reports: vi:Wikipedia:Tin nhắn cho bảo quản viên (sections titled "Báo cáo liên kết rác"), and they add them here: vi:Special:BlockedExternalDomains. I don't see any controversy here, as Citron's goal is to assist in detecting and generating reports, not directly interacting with the blacklist. Just doing this is already a great help to the community, as searching for bad links is a labor-intensive task, while adding them to the blacklist is a very simple action that any administrator knows how to do.(d) They voluntarily participate without compensation, as in my previous project.(e) I do not plan to reveal my identity due to political concerns in my country. Therefore, I decline to participate in any ESEAP monthly community meetings. If you wish to provide feedback for this project, you can message me at User talk:Plantaest.5) Can this be used for other language wikis?Yes, as mentioned in my proposal. When deploying for any wiki, we will create a new instance of Citron through a management interface. Once the corresponding Citron instance is in place, that wiki will receive Citron reports. I will refine the details of this process during the official deployment.6) About the proposed budget, we noticed that there is an increase in per hour rate from the previous grant. Could you provide more details on the per hour rate estimates?Yes, I recognize that this is a more complex project compared to Feverfew (the previous project), as I had to write a clearer draft this time compared to my Feverfew proposal: vi:Thành viên:Plantaest/Citron. This requires more effort in research and data collection, so I would like to request a better rate for this effort. I believe this is still a relatively low rate compared to other software projects such as: Grants:Programs/Wikimedia Community Fund/Rapid Fund/Wikifile-transfer tool’s bug fixes and feature development (ID: 22458511), Grants:Programs/Wikimedia Community Fund/Rapid Fund/Expanding Twinkle Lite functionalities to admins, improving its maintainability and allowing user customisation (ID: 22666838). This rate is lower than some of my recent freelance projects.7) From this project application, it is challenging to understand what is the potential impact of your project and how useful it will be for the community. Could you be able to help us understand/ envision this part better?The answer is time. Imagine, before translation tools like Google Translate or ChatGPT existed, when we wanted to translate a document, we had to learn a new language and use dictionaries, right? Nowadays, with Google Translate or ChatGPT, you can use them to assist with translation and just review the result to ensure there are no significant errors. These tools save you time.Citron is similar; it helps editors save time in detecting unusual links in articles. Manually checking for such links is really labor-intensive, and since we are all volunteers, we cannot participate continuously. If we only continue checking manually as before, missing some malicious links is inevitable, especially as vandals are becoming more and more sophisticated, as I mentioned in the proposal.8) From this project application, it is also not clear how this project is different from the previous application, or perhaps, build on your previous work. Could you help us understand this part better as well?These are two completely different projects, although they may use some of the same technologies, such as Quarkus and React, for building. These technologies are merely tools; you can use them to build either a mansion or a townhouse depending on your needs.Feverfew (the name of the previous project) is a tool that helps check whether the links in an article are “alive” or “dead.” “Alive” means they are accessible, displayable, and contain content, while “dead” means they are inaccessible and return a 404 error. This check helps editors ensure that their references are functioning properly, because dead links can undermine the credibility of the article’s content. For more information, see: en:Wikipedia:Link rot.Citron (the name of this project) is a tool that monitors the RecentChanges feed in real time and generates automated evaluation reports. These reports show which bad links have been added to articles by users over a given period, such as one day. “Bad” here refers to links that violate Wikipedia’s guidelines, usually advertisements for companies, stores, gambling sites, trading platforms, or personal blogs. Administrators will review the report and add these bad links to a blacklist to prevent malicious users from continuing to insert them into articles in the future. For more information, see: en:Wikipedia:Spam#External link spamming.These two projects address completely different issues and are not related to each other. Each one deals with different Wikipedia policies.9) Is the code to be public and under an OSI approved free software license?The source code will be publicly available on GitHub and released under the GNU Affero General Public License version 3, which is an OSI-approved license.10) I'm also curious what "some supplementary data" implies exactly. Where does it come from, how is it created, is it open?In the context of that statement, I mean additional assessments from humans. For example, some links may fall somewhere between “bad” and “good” according to the machine learning model’s evaluation, so editors can re-evaluate them to provide one or more opinions to the administrators on whether these links should be added to the blacklist.11) To evaluate the effectiveness of the machine learning model in checking and classifying links, I plan to use basic metrics such as accuracy, precision, recall, and F1 score. We gather this is a common research approach and wonder if you could also speak to the follow-up work that might be needed, if any?When creating machine learning models, I will experiment with various classification algorithms to determine which one is the “best”, meaning it has the fewest errors and is reliable. To assess this, I need to use the aforementioned metrics. I typically choose the model with the highest F1 score. Once I have the best model, I will integrate it into Citron.12) As this is a project that will be led by you, we would recommend that your work be well- documented so that someone else (if needed, and if there is interest) could take over the codebase next time?I agree. I implemented this in my previous project, Feverfew:
- Introduction page: en:User:Plantaest/Feverfew
- Source code: https://github.com/plantaest/feverfew
- Progress log: en:User:Plantaest/Feverfew/Progress
- Notes: en:User:Plantaest/Feverfew/Notes
- Translation: translatewiki:Translating:Feverfew
- The Citron project will be similar.13) We observe that the timeline in the application is bit high level. For future applications, it would be helpful to provide more details.I agree. I will try to write a more detailed plan for future projects. However, I want to mention that software development is very different from organizing a competition or a workshop like other grants; it is more of a creative process than a strictly scheduled one. This means it depends on research, exploration, and brainstorming, which often leads to uncertainty regarding completion timelines. Sometimes, the chosen solution may not be ideal, and I may have to find a different direction. I can only estimate based on my experience.14) For consistency of terminology, we suggest adopting "allowlist"/"blocklist" instead of "whitelist"/"blacklist"I understand that this change aims to reduce discrimination. However, in the context of MediaWiki software, these terms are familiar and seem technically difficult to change at present, at least within the Vietnamese Wikipedia community, as seen in vi:MediaWiki:Spam-blacklist, vi:MediaWiki:Spam-whitelist. I empathize with the words that may impact certain populations and hope that MediaWiki software can adapt if possible. Generally, I will continue to use "whitelist"/"blacklist" in the project documentation to ensure familiarity. I hope this is not a major issue.Thank you for reading my answers. Sincerely. Plantaest (talk) 20:24, 23 September 2024 (UTC)
Your grant application has been approved
[edit]Hello @Plantaest,
Thank you for taking time to respond to the questions and comments. Your responses helps us understand your thought process better and also holds space for all of us to partake in your project in ways that we can.
Congratulations, your grant application has been approved in the amount of USD 4,950 from 1 October 2024 to 31 December 2024. We noted that you have requested for the grant in USD. As a guideline from WMF Finance, we are encouraged to award applications in the local currency. Do consider this for future applications and be in touch with your programme officer if you have additional questions or concerns.
We know even the best thought through plans may change
In the event that there are changes to your implementation schedule, you can reach out to request for a grant extension (i.e. extend end date). Similarly, if there is a surplus budget or changes to your planned budget, you can reach out to me about reallocation (via email and on this talkpage). There is also the option to have the unspent funds deducted against a future grant. More details here: https://meta.wikimedia.org/wiki/Grants:Return_unused_funds_to_WMF
We thank you once again for your participation in the grant application process and hope to continue to journey with you as you embark on this project. Good luck with your project implementation.
Thank you.
Regards, Jacqueline on behalf of the ESEAP Funds Committee JChen (WMF) (talk) 03:01, 24 September 2024 (UTC)
- @JChen (WMF): Thank you, Jacqueline. I will take the currency issue into consideration in future proposals, as WMF only sends USD anyway. Regarding the remaining matters, I currently have no particular opinions. Thank you for the useful information. Best regards. Plantaest (talk) 08:26, 24 September 2024 (UTC)