Grants:Project/Enrichment of multilingual scientific/technical/medical terms of Wikitionary
What is the problem you're trying to solve?
What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.
The proposed project aims at improving the coverage and the quality of Wiktionary for multilingual STM terms. STM terms are useful for different communities of users, and also challenging from the linguistic point of view, since a significant percentage of these terms are neologisms. I identified this need by attempting to use Wiktionary from samples of STM texts.
- Meyer, Christian M., and Iryna Gurevych. Wiktionary: A new rival for expert-built lexicons? Exploring the possibilities of collaborative lexicography. na, 2012.
What is your solution?
For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.
Guidelines of this project are:
a) architecture: Wikidata as the architectural backbone. Wikidata has to be used as the infrastructure backbone of the project since it is more suitable from the architectural point of view;
b) contents: mutual enrichment of Wiktionary and Wikidata. It is useful to achieve the best from both the contents of Wiktionary and Wikidata, since they have a different coverage across languages on STM-terms. Moreover, a lexicon like Wiktionary will remain for quite a long time a significant resource for the final users, independently from Wikimedia architecture.
c) results evaluation from entities extracted by STM documents: a set of STM documents (by topic and by language) will be used to measure the coverage and quality of terms on Wiktionary and Wikidata a) before and after the research project and; b) to compare eventual variants and/or tunings of these algorithms to identify the most suitable/efficient.
d) blending automatic and human annotations: the project will use both approaches, since only some specific terms and/or contents can be extended purely automatically. On the other side, the project will provide suggestions to community annotators about terms or contents which should be integrated by the community.
What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.
The project will be significantly extend the quantity and quality of Wiktionary for the number of terms and for the language supported in scientific, technical and medical (STM) sectors:
a) by using information which can be automatically extracted from Wikipedia, when possible;
b) by manually extending Wikitionary in Italian for information which can not be inferred from Wikipedia;
c) by suggesting to Wikimedia colleagues terms in languages different from Italian which should be manually extended;
d) to disseminate the awareness of these new results also by an open paper.
How will you know if you have met your goals?
For each of your goals, we’d like you to answer the following questions:
- During your project, what will you do to achieve this goal? (These are your outputs.)
- Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)
For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (e.g. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.
1. More Wiktionary entries (in order of thousands new terms) and general assessment;
2. Wiktionary will improve in terms of quality (in terms of more interproject links between Wikipedia, etc. in terms of hundreds new entries), depth (in terms of synonyms, acronyms, anagrams, etc.) and links between more language editions of Wiktionary;
3. One paper (in an Open-Access review) to expose results.
During the project
Demo and method
1. Providing a validated demo and method implementing algorithms which can extend the coverage and quality of Wiktionary STM entries in multiple languages (in order of thousands new terms), as much as possibly "automatically", taking benefit also from other Wikimedia resources, including Wikidata and Wikipedia;
2. Providing a validated demo and method implementing algorithms which can extend the coverage and quality of Wiktionary STM entries in multiple languages (in order of thousands new terms), taking into account efficiently neologisms, which are very frequent in these domains.
3. Providing assessments of the different releases of the prototypical solution, before the project and during the project and at the end of the project, to identify the most efficient implementation of the solution, which will be suggested for deployment after the project. The assessment will be done using as test-sets public STM documents.
4. Providing assessment of what can be achieved automatically and what have to be achieved by human contributors, then, it would be assessed how would impact the process in terms of finding new STM terms. The assessment will be done using as test sets public STM documents.
After the project
1. Enabling third parties, including Wikimedia researches, to engineer and deploy the final solution prototyped in this project.
Do you have any goals around participation or content?
Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable. Remember to review the tutorial for tips on how to answer this question.
Content. One of the goal of the project is to extend the coverage and the quality of STM contents in Wiktionary, and, when appropriate, Wikidata. In any case, whilst the quantity of STM terms could not so relevant today, their increase in coverage and quality can attract new categories of participants.
Participation. A) Readers. Given the improvements of Wiktionary, I guess that the reader's community will benefit ; b) Contributors. Given that the project solution would suggest where to find candidate terms typically to be added, we will could also address community contributions and discussions.
Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?
The project will start when approved and will last 9 months.
The project includes the following tasks:
a) specifications: detailing the algorithm to be coded and the quantitative and qualitative achievements which will be reported;
b) implementation: implementation of the process having the objective to speed-up the automatic extension of Wiktionary from Wikipedia;
c) evaluations: analyze and validate the results generated automatically;
d) manual extensions (Italian): manually extend the vocabulary for the Italian language in the case of entries which can not be extended automatically;
e) community engagement: disseminate results and needs internally to Wikimedia, with special attention to disseminate the awareness about entries which can not extended automatically for languages different from the Italian and that are proposed to other Wikimedia tasks and persons;
f) dissemination: produce a an open published scholar report at the conclusion of the activity;
g) project management.
The project will include the following deliverables:
1. Phase 1. Delivered at month T0+3
1.1. Detailed description of the solution (PROJECT DOCUMENT)
1.2. Implementation of the automatic extension algorithm, which will be demonstrated for manually provided data sets (DEMO)
2. Phase 2. Delivered at month T0+6
2.1. integration of the automatic extension algorithm with a topic spider (DEMO)
2.2. release of a first set of Wikitionary extensions produced automatically (DATA)
3. Phase 3. Delivered at month T0+9
3.1. Eventual algorithm improvements (DEMO)
3.2. Release of the final set of Wkitionary extensions produced automatically (DATA)
3.3. Paper to disseminate results achieved and the algoritms implemented (PAPER)
How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!
The expenses proposed are intended to pay the activity of Marco Ciaramella (myself), for a 6 man months effort and a 9 months elapsed, at 4K dollar by man-month, for a total amount of 24 K $. Marco Ciaramella will be put in partial leave from IntelliSemantic for the 60% of his time to follow this project, with a corresponding reduction of his personal salary and of company-related expenses, according to the Italian law 53/2000. The activites of Marco Ciaramella will cover all project tasks already mentioned:
d) manual extensions (Italian)
e) community engagement
g) project management
Marco Ciaramella will produce a one page monthly report (see: g) which will be summarized as: 1. effort, 2. activities performed 3. results achieved.
The expenses proposed does not include the activity of Alberto Ciaramella, for 1 man-month effort, for the participation to activities a) and f). Alberto Ciaramella will donate his activity.
Community input and participation helps make projects successful. How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve during your project?
- I will report to the community the terms set under current analysis to be added to Wiktionary and Wikidata ;
- I will inform the community of project evolutions regularly and I will collect feedbacks from the community about the project.
Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.
As a background, my current work in natural language processing and linked data is in a company based in Italy (IntelliSemantic srl https://intellisemantic.com/). In any case, I will use a partially unpaid leave from my company to participate to this project, as already explained in the section "Grant". I participated to the EU-funded project TOPAS (Tool Platform for Patents Analysis and Summarization) and to regional funded projects. My publications are available at my Google Scholar page. Other than that, I am active in Wikipedia since the beginning of the Italian chapter.
Other than me, Alberto Ciaramella (see: Scholar page) will partecipata as volunteer for some tasks.
Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?
- Updates on the Project page;
- Discussion and updates on WikiResearch mailing list;
Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).
- Strong support Stalinjeet Brar (talk) 06:07, 6 December 2018 (UTC)
- Support --.mau. ✉ 13:13, 16 January 2019 (UTC)
- Support --Lalupa, January 20, 2019
- Support I posted some comments on the talk page. This is a complicated project and I have my doubts that Wikimedia projects have the present technology and social infrastructure to do this. However, I believe in the importance of these outcomes, and would like an early experiment to publish this content. I develop medical content in Wikimedia projects and I myself require the sort of word list that this project proposes to create. Blue Rasberry (talk) 14:48, 22 January 2019 (UTC)
- Support good idea/effort Ozzie10aaaa (talk) 11:34, 25 January 2019 (UTC)
- Support A really good idea, I'm really interested in the outcome. Sannita - not just another it.wiki sysop 15:07, 25 January 2019 (UTC)
- Support --Mozucat (talk) 16:54, 25 January 2019 (UTC)
- Oppose I fail to see (1) how this grant justifies the cost, and (2) why this can't be conducted as a volunteer activity and (3) why this particular editor should receive funding above any other editor. This activity is one that volunteers with bots could readily do and I fail to see why we are paying someone for it. Oppose. Tom (LT) (talk) 00:13, 26 January 2019 (UTC)
- @Tom (LT): Thank you Tom for your feedback. However, this comment sounds like not being addressing the object of this proposed research project. I think that these doubts could be discussed in the discussion page. Marco Ciaramella (talk) 11:02, 26 January 2019 (UTC)