Grants talk:Programs/Wikimedia Community Fund/Rapid Fund/Language models for generating Wikidata descriptions in Southeast Asian languages (ID: 22459507)

Follow-up questions, comments and observations from ESEAP Funds Committee and Programme Officer[edit]

Latest comment: 1 month ago1 comment1 person in discussion

Thank you for putting in a grant application. These are consolidated inputs from the review of your grant application. The following points are for your further response:

Background

1) Could you provide more details on your background, technical skills and past contributions as that would help provide the assurance that you have the experience/ skillsets to implement the proposed project? We have access to these information - Wikimedia Username: Alphama; applicant’s global account; global user contributions

2) We would like your confirmation that this is an individual grant (i.e. separate from the Vietnamese Wikimedians User Group) This is possible and here's an example of past precedence. For example Amanda Lawrence, President Wikimedia Australia applied Research Fund separate from WMAU Grant: https://meta.wikimedia.org/wiki/Grants:Programs/Wikimedia_Research_Fund/Reliable_sources_and_public_policy_issues:_understanding_multisector_organisations_as_sources_on_Wikipedia_and_Wikidata

Community engagement

3) Do you have a preliminary idea of who the 12 voluntary participants (in four languages) will be, how do you plan to select/ invite them to participate in the feedback/review process?

4) There is an observation that while you have connected with some people about your project idea, there is no evidence that a consensus or discussion was made. Do you have a response to this?

Documentation

5) How do you plan to document the work that you are doing so that more people can learn about the process or continue from what you have started? Will the information be available and easily accessible on a metapage for example?

6) We also recommend putting up a Diff post to share your experiences post project.

Budget

7) Could you review your proposed budget, in particular the expense category to make sure that you have accurately matched the budget line item with expense categories? For example, row 5 (data collection by web crawler/ hours) and row 9 (recommended tool building/ hours) reads like staff related expenses.

8) Could you share the names of all the tools (perhaps with the pricing links)? Eg: OpenAI (ChatGPT): https://openai.com/api/pricing. You may wish to add this to your proposed budget in the column expense details.

Project Management

9) Could you share the timeline for the project? Including how much time estimated needed for human evaluation of 6000 entries.

Thank you.

Regards, Jacqueline JChen (WMF) (talk) 02:21, 16 May 2024 (UTC)Reply

My response[edit]

Latest comment: 1 month ago1 comment1 person in discussion

Hello Jacqueline,

Thank you for your questions about the proposal. Here are my answers.

1) Could you provide more details on your background, technical skills and past contributions as that would help provide the assurance that you have the experience/ skillsets to implement the proposed project? We have access to these information - Wikimedia Username: Alphama; applicant’s global account; global user contributions[edit]

Since 2012, I have contributed over 129,000 edits to Wikimedia projects, focusing primarily on the Vietnamese Wikipedia. I used to have roles as a sysop and interface admin there. I founded the Bot project and created tools like the Alphama Editor, which corrects errors and adds minor content in many thousands of articles. This tool is now managed by AnsterBot. I also have experience with a technical grant from my first project, (Semi-automatically generate Categories for Vietnamese Wikipedia). Additionally, I am an NLP researcher with a Ph.D. in C.S., which qualifies me to complete this project successfully.

2) We would like your confirmation that this is an individual grant (i.e. separate from the Vietnamese Wikimedians User Group) This is possible and here's an example of past precedence. For example Amanda Lawrence, President Wikimedia Australia applied Research Fund separate from WMAU Grant: https://meta.wikimedia.org/wiki/Grants:Programs/Wikimedia_Research_Fund/Reliable_sources_and_public_policy_issues:_understanding_multisector_organisations_as_sources_on_Wikipedia_and_Wikidata[edit]

Yes, I confirm this is an individual grant, and it does not relate to the Vietnam Wikimedians User Group or any user groups in Meta.

3) Do you have a preliminary idea of who the 12 voluntary participants (in four languages) will be, how do you plan to select/ invite them to participate in the feedback/review process?[edit]

I have contacted three presidents—Taufik Rosman, Anthony Diaz, and Rachmat Wahidi—of the Malaysia, Philippines, and Indonesia user groups, respectively. They provided me with a list of participants, including myself. The Malaysian members are Tofeiku, Ultron90, and PeaceSeekers. The Vietnamese members include Alphama, Song Ngư, NhacNy2412, and Bluetpp, among others. Participants will be included in the call for the Indonesian and Philippines communities, and I have received email responses from both groups confirming this. Additionally, I have consulted with Sakti Pramudya regarding the Indonesian communities. I also plan to make a notice of my proposal to Wikimedia communities to gain more attraction and new members.

4) There is an observation that while you have connected with some people about your project idea, there is no evidence that a consensus or discussion was made. Do you have a response to this?[edit]

My idea is to connect User Groups to work on a joint project in the ESEAP region. This proposal may be new because we need members in diverse communities; therefore, there is no clear consensus ahead. This is also because of the psychological safety when members engage in multi-group projects they have never participated in. However, I have contacted many stakeholders, and their reactions by email are positive. Therefore, I believe the success of this project will spread out new forms of collaboration not only in the ESEAP region but also at the international level.

5) How do you plan to document the work that you are doing so that more people can learn about the process or continue from what you have started? Will the information be available and easily accessible on a metapage for example?[edit]

As with my previous grants, I will create a project space on Meta to oversee the project. This space will allow participants and stakeholders to access documents, learn how to participate in the project, and communicate with each other if they encounter any issues.

6) We also recommend putting up a Diff post to share your experiences post project.[edit]

Yes, this is a great idea. I will create a Diff post to share our work and experiences on the project. This post will serve as evidence that we can collaborate effectively, regardless of language and nationality.

7) Could you review your proposed budget, in particular the expense category to make sure that you have accurately matched the budget line item with expense categories? For example, row 5 (data collection by web crawler/ hours) and row 9 (recommended tool building/ hours) reads like staff related expenses.[edit]

I'm unsure how to categorize the items in row 5 and row 9. I assumed they used tools to collect data online, requiring some computers to hang out continuously (24/7). If they are related to staff expenses, or if they cover both tools and staff expenses, please let me know so I can make the necessary adjustments.

8) Could you share the names of all the tools (perhaps with the pricing links)? Eg: OpenAI (ChatGPT): https://openai.com/api/pricing. You may wish to add this to your proposed budget in the column expense details.[edit]

For ChatGPT, we use "gpt-3.5-turbo-instruct" (https://openai.com/api/pricing/), inputs has $1.50/1M tokens and outputs has $2.00/1M tokens. For the cost, it is better to show an example. In the example below, we need 23 input tokens (prompt_tokens) and 195 output tokens (complention_tokens) for producing 10 most quality descriptions, which to be sure that we have the best translation quality by using majority vote from the result.

{'id': 'cmpl-9SLdQKlzL8uuTgnlnodqgyn2DKvSV', 'object': 'text_completion', 'created': 1716543376, 'model': 'gpt-3.5-turbo-instruct', 'choices': [{'text': '\n\n2001 chương trình truyền hình Hàn Quốc', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 chương trình truyền hình Hàn Quốc.', 'index': 1, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 chương trình truyền hình Hàn Quốc.', 'index': 2, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 chương trình truyền hình Hàn Quốc.', 'index': 3, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 chương trình truyền hình Hàn Quốc.', 'index': 4, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 chương trình truyền hình Hàn Quốc.', 'index': 5, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 chương trình truyền hình Hàn Quốc.', 'index': 6, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 chương trình truyền hình Hàn Quốc.', 'index': 7, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 Chương trình truyền hình Hàn Quốc.', 'index': 8, 'logprobs': None, 'finish_reason': 'stop'}, {'text': '\n\n2001 Chương trình truyền hình Hàn Quốc.', 'index': 9, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 23, 'completion_tokens': 195, 'total_tokens': 218}}

prompt: Translate this text from English to Vietnamese: "2001 South Korean television drama". Show only the result without quotes.

answer: ['2001 chương trình truyền hình Hàn Quốc', '2001 chương trình truyền hình Hàn Quốc.', '2001 chương trình truyền hình Hàn Quốc.', '2001 chương trình truyền hình Hàn Quốc.', '2001 chương trình truyền hình Hàn Quốc.', '2001 chương trình truyền hình Hàn Quốc.', '2001 chương trình truyền hình Hàn Quốc.', '2001 chương trình truyền hình Hàn Quốc.', '2001 Chương trình truyền hình Hàn Quốc.', '2001 Chương trình truyền hình Hàn Quốc.']

Note that a Wikidata description has a maximum length of 12 words instead of 5 words, according to their guildline. To make it easy, we calculate 25 input tokens and 200 output tokens.

In our project, we need to collect a minimum of 50000 descriptions in 4 languages, not including English. So, each language has 1.875 USD for input tokens (50000 * 25 = 1250000 ~ 1.875 USD) and 20 USD for output tokens (50000 * 200 = 10000000 ~ 20 USD). In total, there is a 21.875 USD for each language. However, in reality, I need to have 4 times data collection by each language because there could be some mistakes when we create input prompts or create collection code, which could lead to wrong results. This happens very often with anyone who deals with ChatGPT APIs. Therefore, I assume each language should be 21.875*4 = 87.5 USD.

9) Could you share the timeline for the project? Including how much time estimated needed for human evaluation of 6000 entries.[edit]

This is my proposal timeline from June to July 2024:

From 01/06 to 15/06: Crawl Wikidata descriptions in 4 languages from Wikidata, Wikipedia, and ChatGPT.

From 16/06 to 01/07: 12 annotators will evaluate the quality of gathered descriptions (3000 descriptions, 250 ones for each annotator) and correct them if possible. This is before training language models.

From 02/07 to 15/07: 12 annotators will evaluate the quality of generated descriptions (3000 descriptions, 250 ones for each annotator) and correct them if possible after training language models.

From 16/06 to 01/08: Train language models, including adapting translation quality from annotators.

From 02/08 to 21/08: Build a tool to produce descriptions from language models and allow users to export to *.json files, then can be used to import to Wikidata.

From 22/08 to 01/09: Write documents, gather comments, and finish the project.

@JChen (WMF): Let me know if you need anything else. Thank you and have a nice day. A l p h a m a ^Talk 09:55, 24 May 2024 (UTC)Reply

Your grant application has been approved[edit]

Latest comment: 29 days ago2 comments2 people in discussion

Hello @Alphama

Thank you for your detailed responses to the follow-ups questions.

Congratulations! Your grant application has been approved in the amount of USD 4945.50 from 1 June 2024 to 1 September 2024.

Let’s continue having regular conversations over the course of your grant implementation. Please let me know if you require support in any way or would like to share your experiences with a wider community through the Let's Connect Programme or at ESEAP community meetings.

Note on grant funding amount and changes in plans

We know even the best thought through plans may change.

In the event that there are changes to your implementation schedule, you can reach out to me to request for a grant extension (i.e. extend end date). Similarly, if there is a surplus budget or changes to your planned budget, you can reach out to me about reallocation (via email and on this talkpage). There is also the option to have the unspent funds deducted against a future grant. More details here: https://meta.wikimedia.org/wiki/Grants:Return_unused_funds_to_WMF

Additional resources which may be useful

The reporting requirements and templates for the grant can be found here. All reports are to be completed and submitted via Fluxx.
Timelines for reporting can be found in your grant agreement or on Fluxx.
Instructions for post award: https://meta.wikimedia.org/wiki/Grants:Processing/Grantee_Portal/Post_award
Documenting project expenses: https://meta.wikimedia.org/wiki/Grants:Documenting_project_expenses

Recommendations

For this grant application, we would like to flag questions/ responses for (5), (6) for inclusion into your implementation plan. It would be also useful to highlight these considerations have been taken into account in your final report, where possible.

We thank you for your participation in the grant application process and hope to continue to journey with you as you embark on this project. Good luck!

Regards, Jacqueline on behalf of the ESEAP Funds Committee JChen (WMF) (talk) 03:13, 27 May 2024 (UTC)Reply

@JChen (WMF) Thank you and I will follow your suggestions on this project. A l p h a m a ^Talk 03:26, 28 May 2024 (UTC)Reply

Licensing[edit]

Latest comment: 23 days ago2 comments2 people in discussion

Will the code, datasets and models produced here be available publicly? If so, under what license (hopefully, free licensing)? In practice I would be interested in how other developers of Wikimedia tools can potentially benefit. We might want to make sure such licensing is compatible with terms of service of proprietary services (like ChatGPT) that might be used here. --whym (talk) 10:20, 1 June 2024 (UTC)Reply

@Whym Hi, you can refer to https://help.openai.com/en/articles/6783457-what-is-chatgpt to read the question "Can I use output from ChatGPT for commercial uses?", which "Subject to the Content Policy and Terms, you own the output you create with ChatGPT, including the right to reprint, sell, and merchandise – regardless of whether output was generated through a free or paid plan." For this project, I sure that the datasets, and models will be publicly free (CC 4.0 BY) but I have to consider the code. Thank you for your good question. A l p h a m a ^Talk 03:08, 3 June 2024 (UTC)Reply