Grants talk:Project/Diegodlh/Web2Cit: Visual Editor for Citoid Web Translators

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

(See User talk:Diegodlh/VisualCitoidTranslatorEditor for discussions previous to the submission of this proposal)

Translator storage and collaboration[edit]

One of the points we would really appreciate feedback and comments about is around translator storage and collaboration.

As described in the Frontend and Backend subsections of the Development section, right now the idea is to have two complementary storage locations.

On the one hand, a central database that holds metadata about the community-contributed translators. Specifically:

  • UUID: translator's unique identifier (just like any official translator)
  • URL matching pattern: to quickly know which translators should be used for a given URL, without having to download the actual translator code
  • path: to the translator User Script file (see below)
  • checksum: last known checksum of the translator file
  • test results: results from last translator test run
  • votes: up and down votes issued for the translator

On the other hand, the translator itself. This can be represented in two equivalent ways: (1) the translator definition (in a custom XML or JSON format), as built by the user using the frontend editor; and (2) the JavaScript version of the translator, as understood by the Zotero translation server. These two ways would be saved to one file: JavaScript translator with commented XML/JSON definition at the top.

The idea is to save these translator files as Wikipedia User Scripts. This way, Web2Cit won't have to handle user accounts, file storage and tracking of changes. However, the downsides would be that (1) only users logged in to Wikipedia will be able to create translators, and (2) that only the owner of a translator will be able to edit it, with other logged in users retaining the right to fork it instead. These downsides may limit community collaboration around the translation of a specific site.

A similar topic was raised around User:V111P's WebRef, a similar citation extration tool from the pre-Citoid era. With WebRef, users created rules to extract citation metadata from different websites, and stored these rules ("webRefSiteData") locally or as User Scripts. The fact that Web2Cit would have a central database, enabling users to choose from one or more community-contributed translators matching a given URL, would partly address a question about WebRef raised some time ago by User:Fuzheado: "how people can collaboratively help build a database of webRefSiteData" (3rd message of this thread). However, collaboration would be limited to being able to use translators created by other users. Alternatively, one could think of having only one translator per URL matching pattern, and having all users edit that translator. However, this may open the door for accidental (or intended) bugs, that would break translation for a given source (as also discussed further down in that thread).

I would appreciate any thoughts, comments and ideas from the community about this. Is the proposed way, of having more than one translator per matching URL, OK; with each logged-in user being able to contribute their own translator (possibly forking from another one), and with translators (and citations generated from them) sorted by user votes? Or should we have only one community translator per URL matching pattern, and have all users (even anonymous ones) edit that translator in a Wikipedia-article way, with the risk of having the translator accidentally (or purposefully) break for everybody? --Diegodlh (talk) 21:55, 17 March 2021 (UTC)[reply]

Parallel efforts to improve metadata embedding by websites[edit]

Having site-specific translators is a workaround for websites not exposing their metadata appropriately. In addition to collaboratively closing the gap by widening the coverage of these site-specific translators (the goal of this proposal), parallel efforts to make sure websites appropriately expose their metadata are necessary as well. This is in line with Zotero developers' view that adding JSON-LD support to their Embedded Metadata translator is important, as that would make sure websites exposing their metadata using JSON-LD can be understood without having to write a site-specific translator.

Regarding this, I'd like to mention a conversation we had with User:Fuzheado in the Wikicite Telegram chat, so it doesn't get lost there. In 2017 they gathered a few top news reference sites and tested them by hand on how well they were presenting their metadata (as seen by Citoid). The idea would be to create a quarterly report scorecard to praise or shame news/web sites performing high or low related to this. Andrew suggested to start with the list of URLs they used in 2017.

I offered to automate the process by, given this list of article URLs: 1) capture a snapshot of how they look like in a web browser, 2) call the Citoid API to retrieve metadata using all Zotero translators, and render a newspaper-looking image with these data, 3) call the Zotero translation server using just the generic Embedded Metadata translator and do the same as in (2) (*). The outcome would be a bunch of three-pane "(1) this is how you think we see you, (2) this is how we see you (with lots of volunteer help; i.e., site-specific translators), (3) this is how we really see you" images, that someone with better UX/UI skills than mine can accommodate into a polling website where volunteers can say whether main fields (title, author, date, publisher) match or not.

(*) It is worth noting that the website might be using JSON-LD to expose metadata, which is not supported by Zotero yet, although they would like to support it soon.

--Diegodlh (talk) 22:52, 17 March 2021 (UTC)[reply]

Proposal status[edit]

Dear @Mjohnson (WMF): we have recently noticed that all software grant proposals have been moved to "under review" on Apr 17, except two, among which is this proposal we presented with User:Scann. Since all proposals are currently under committee scoring until May 2, we would appreciate it if you could let us know what the status of our proposal is, and if any actions are pending on our side to have it moved to "under review" as well. Thank you! --Diegodlh (talk) 13:36, 20 April 2021 (UTC)[reply]

Just for the record, User:Mjohnson_(WMF) replied on the Projects Grant talk page thread. --Diegodlh (talk) 17:48, 21 April 2021 (UTC)[reply]

Eligibility provisionally confirmed, Round 2 2021 - Research and Software proposal[edit]

IEG review.png
This Project Grants proposal is under review!

We've provisionally confirmed your proposal is eligible for review in Round 2 2021 for Research and Software projects, contingent upon:

  • confirmation that the project will not depend on staff from the Wikimedia Foundation for code review, integration or other technical support during or after the project, unless those staff are part of the Project Team.
  • compliance with our COVID-19 guidelines.

Schedule delay

Please note that due to unexpected delays in the review process, committee scoring will take place from April 17 through May 2, instead of April 9-24, as originally planned.

  • Please watch your talkpage, which will be the primary method of communication about your proposal. We appreciate your timely response to questions and comments posted there.
  • Please refrain from making changes to your proposal during the scoring period, so that all committee members score the same version of your proposal.
  • After the scoring period ends, you are welcome to make further changes to your proposal in response to committee comments.

COVID-19 planning for travel and/or offline events

Proposals that include travel and/or offline events must ensure that all of the following are true:

  • You must review and can comply with the guidelines linked above.
  • If necessary because of COVID-19 safety risks, you must be able to complete the core components of your proposed work plan _without_ offline events or travel.
  • You must be able to postpone any planned offline events or travel until the Wikimedia Foundation’s guidelines allow for them, without significant harm to the goals of your project.
  • You must include a COVID-19 planning section in your activities plan. In this section, you should provide a brief summary of how your project plan will meet COVID-19 guidelines, and how it would impact your project if travel and offline events prove unfeasible throughout the entire life of your project.

Community engagement

We encourage you to make sure that stakeholders, volunteers, and/or communities impacted by your proposed project are aware of your proposal and invite them to give feedback on your talkpage. This is a great way to make sure that you are meeting the needs of the people you plan to work with and it can help you improve your project.

  • If you are applying for funds in a region where there is a Wikimedia Affiliate working, we encourage you to let them know about your project, too.
  • If you are a Wikimedia Affiliate applying for a Project Grant: A special reminder that our guidelines and criteria require you to announce your Project Grant requests on your official user group page on Meta and a local language forum that is recognized by your group, to allow adequate space for objections and support to be voiced).

We look forward to engaging with you in this Round!

Questions? Contact us at projectgrants (_AT_) wikimedia  · org.

Marti (WMF) (talk) 18:14, 26 April 2021 (UTC)[reply]

Aggregated feedback from the committee for Web2Cit: Visual Editor for Citoid Web Translators[edit]

Scoring rubric Score
(A) Impact potential
  • Does it have the potential to increase gender diversity in Wikimedia projects, either in terms of content, contributors, or both?
  • Does it have the potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
8.5
(B) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
7.5
(C) Ability to execute
  • Can the scope be accomplished in the proposed timeframe?
  • Is the budget realistic/efficient ?
  • Do the participants have the necessary skills/experience?
7.8
(D) Measures of success
  • Are there both quantitative and qualitative measures of success?
  • Are they realistic?
  • Can they be measured?
7.8
Additional comments from the Committee:
  • The project fits with Wikimedia's strategic priorities. However its long term sustainability will depend on maintenance of the Web2Cit backend and web proxy services, which are not described in detail. I would like the applicant to comment on long term plans for this maintenance.
  • improving quality of the sources referenced and simplifying editing is a big priority. And opening the source code and inviting (and getting positive responses) from volunteers for translations is also helping
  • This project is a step forward in the strategic direction of improving our citation infra-structure, thus connects directly to our strategic priorities (knowledge as service).
  • The approach is innovative. However the project is complex and there are significant risks, which are justified though. There are clear measures of success.
  • I have no idea why no-one proposed the same thing before, but certainly a good idea
  • This is an incremental step in a development path that is already being worked on. There is research popping up connecting Zotero and Wikimedia; there are several lines of development, particularly associated with the Wikicite community. I am unsure if this is the right moment for such a visual interface, or if we should wait more tech stuff to happen before. If the user-friendly interface happens now, how will adjustments be made? I am missing a sense of who will handle maintenance and how?
  • The project can be accomplished in 12 months provided qualified developers are found. The budget is ok. The participants probably has necessary skills.
  • lots of hours budgeted and I think they would really be needed to implement such a project
  • Stellar team! :)
  • The projects supports diversity and have some community support.
  • High support from key Wikimedians.
  • Visual editor needs improvements. Probably into the knowledge as service's direction.
  • I am generally supportive of this project. However the question of its and its backend long term sustainability should be address during the due diligence process.
  • Great project. The only thing I would also ask for is to have project bugs/development being coordinated via Phabricator/Gerrit, so that it is sustainable and open for contributions of others, too
  • This is an amazing project from two very active Wikimedians. To fully support this I would need more clarity on how the dynamics of maintenance and improvements are envisioned. This is particularly important as the tool is building upon processes that are still on the go and that might change a lot. I would be interested in knowing what Liam Wyatt (WMF) thinks about this and the general plan of 'shared citations'.
IEG IdeaLab review.png

This proposal has been recommended for due diligence review.

The Project Grants Committee has conducted a preliminary assessment of your proposal and recommended it for due diligence review. This means that a majority of the committee reviewers favorably assessed this proposal and have requested further investigation by Wikimedia Foundation staff.


Next steps:

  1. Aggregated committee comments from the committee are posted above. Note that these comments may vary, or even contradict each other, since they reflect the conclusions of multiple individual committee members who independently reviewed this proposal. We recommend that you review all the feedback and post any responses, clarifications or questions on this talk page.
  2. Following due diligence review, a final funding decision will be announced on Thursday, May 27, 2021.

Questions? Contact us at projectgrants (_AT_) wikimedia  · org.

Marti (WMF) (talk) 22:17, 7 May 2021 (UTC)[reply]

Response to committee feedback[edit]

We would like to thank the committee for taking the time to review our proposal. We have noticed that some members raised concerns regarding whether the project can be "sustained, scaled, or adapted elsewhere after the grant ends". We trust the project will not reach a dead end when the grant finishes and we would like to develop on this aspect around the following three points: (1) that the software will be robust enough, with no significant external dependencies, thus requiring minimum maintenance to ensure that it will continue to work after the grant has ended; (2) that the code will be open and welcoming for other contributors; and (3) that the use cases of the tool are unlikely to become obsolete in any time soon, since non-structured sources are expected to keep appearing and changing.

Expected maintenance needs. First, concerning actual maintenance needs, we would like to highlight that we expect the project to be relatively robust, as it is planned to be independent from external projects and services, thus requiring minimum maintenance on that side:

  • Web2Cit will help create Zotero translators. The Zotero translation framework will be used by the Web2Cit backend, and is already used by Citoid, Zotero connectors, ZoteroBib, among others. Right now there are more than 500 official Zotero translators. Thus, it would be expected that any change introduced to the Zotero translation framework should be backwards compatible with already existing translators. Anyway, should any such major changes to the framework ever occur, the Web2Cit backend could still continue to use an old version of it.
  • The proposal considers the (optional) possibility that Citoid could use the Web2Cit API as an additional source. Given that this (optional) communication would be one way only, any changes introduced to Citoid will not affect Web2Cit. On the other hand, the proposal does consider using the Citoid API to show official translations for a given URL, but (i) this is just a convenient extra which is not central to Web2Cit functioning, and (ii) the proposal already considers an alternative to relying on the Citoid API by running a full Zotero translation framework (see footnote in the Dashboard section).
  • As described in the proposal, Web2Cit will rely on some already existing Wikimedia services, including: (i) saving community translators as Wikipedia User scripts; (ii) logging users in via OAuth; and (iii) hosting of back-end, web proxy and database at Toolforge. We believe it is safe to trust these services will remain stable and available long enough.
  • Web2Cit frontend will run as a browser extension or bookmarlet. Shoud it run as a browser extension, it may be impacted by changes introduced to the WebExtensions API. However, as these changes would impact the whole universe of browser extensions, we expect such breaking changes would be unlikely. A bookmarklet, on the other hand, would be independent of any such changes.

Development community. Second, as we understand users may demand improvements to the original service, we would like to underscore that we will make sure the project is welcoming to any developer volunteers/institutions who would like to contribute their to maintain and improve it, both before and after the grant has ended. As already stated in our proposal, we will host the code in Gerrit, under a libre software license, and track issues in Phabricator. In addition, we will:

  • Prepare and provide documentation for developers, including:
    • A software architecture overview, to help potential developers understand the different parts of the software and the relations among them, before delving into the specific code implementation, such as the one provided by OpenRefine-Wikibase reconciliation service
    • A codebase orientation guide, to help potential developers understand how the code is organized in the repository.
    • Thoroughly commented code, to help potential developers understand what each part of the specific code implementation does.
    • Contributor guidelines, explaining how to set up a development environment and submit contributions.
  • Provide a testing infrastructure, to promote developer participation, by helping them and potential code reviewers make sure that possible contributions are not introducing bugs.
  • Consider implementing Continuous Integration (CI) workflows, to make sure contributions are easily and rapidly incorporated, including language translations that may be provided via the collaborative translation project to be set up on translatewiki.net.

The project may not only attract the Wikimedia developer community, but also other developer communities, such as Zotero's. The project's communication lead will make sure it is widely communicated within these communities, to reach and involve potential contributors, including volunteer developers. During the communication phase of the project, we will make sure to outreach to allies or potential partners interested in continuing to develop the project. We don't discard the option to apply for more funding to incorporate new features to the tool if needed, either through Wikimedia grants or through some other institutions.

Nonetheless, we understand that there may appear some bugs which may need urgent attention, specially during the first months after the grant has ended, until a developer community is gathered around the project. Hence, the main developer commits to fixing urgent bugs in a period of at least 6 months after the grant has ended.

Long-term usage. Finally, regarding sustainability of the collaborative source metadata translation project (which this tool will contribute to), it tackles a persistent problem that we foresee will keep on happening in the future. This is not a tool designed to solve a local need, but rather a need that may never end, because unfortunately websites are not building their metadata properly. Involving different communities that are passionate about language and knowledge citations is key to the success of this project, and we expect that small language wikis that need to solve this problem will help in maintaining the translators and the community base. We expect a slow but steady adoption process for this tool. We know that Zotero translators are not only used by Citoid but by other Zotero communities, and by other projects. We expect them to see the usefulness of this tool and keep on using it.

We also want to note that the community engagement process is being led by a long-standing community member (User:Scann) who has been around Wikimedia/Creative Commons/open knowledge communities for over 10 years, and she will disseminate the tool and encourage others to use it after the project has ended, the same way she has been doing with other tools that she uses for her Wikipedia editing.

--Diegodlh (talk) 14:47, 11 May 2021 (UTC)[reply]

Round 2 2021 decision[edit]

IEG IdeaLab review.png

Congratulations! Your proposal has been selected for a Project Grant.

The committee has recommended this proposal and WMF has approved funding for the full amount of your request, $55,425

Comments regarding this decision:
The committee is pleased to fund a tool to address important gaps in citation formatting support, especially for volunteers in regions that are important for knowledge equity.

Because it can be difficult to build effective code of the kind to be used in this tool (i.e. code that generates code), we especially appreciate your efforts to focus on a minimum viable product that is as stable as possible (versus one that might have more features but perhaps be less stable), and to prioritize recruitment of other open source software developers who can help with co-maintenance of the tool over time, should bugs arise after the grant funding has ended. We see this as equally or even more important to creating good documentation, since maintenance of tools is often not as appealing for open source volunteers to prioritize unless they have already become invested in the tool during the building phase. The benefit of prioritizing ways to encourage investment and engagement of open source volunteers will have long term benefits for the tool, even if it may in some ways slow the build process.

Also, because the tool as currently proposed will not be integrated into the Citoid extension, we ask that you come up with a robust awareness plan that goes beyond one time notification at the project end, and considers how you can integrate stable references to the tool in places where there are ongoing chances for discovery. Because of the implicit knowledge equity goals of this tool, it is especially important that you think about how to do this in the regions and contexts that best support those goals. It’s a good idea to start consulting early on with Wikimedians in target user contexts in order to find out their thoughts about the best way to share the tool in those contexts.

NOTE: Funding of any offline activities (e.g. travel and in-person events) is contingent upon compliance with the Wikimedia Foundation's COVID-19 guidelines. We require that you complete the Risk Assessment Tool:

  • 14 days before any travel and/or gathering event
  • 24 hours before any travel and/or gathering event

Offline events may only proceed if the tool results continue to be green or yellow.

Next steps:

  1. You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
  2. Review the information for grantees.
  3. Use the new buttons on your original proposal to create your project pages.
  4. Start work on your project!

Upcoming changes to Wikimedia Foundation Grants

Over the last year, the Wikimedia Foundation has been undergoing a community consultation process to launch a new grants strategy. Our proposed programs are posted on Meta here: Grants Strategy Relaunch 2020-2021. If you have suggestions about how we can improve our programs in the future, you can find information about how to give feedback here: Get involved. We are also currently seeking candidates to serve on regional grants committees and we'd appreciate it if you could help us spread the word to strong candidates--you can find out more here. We will launch our new programs in July 2021. If you are interested in submitting future proposals for funding, stay tuned to learn more about our future programs.


Marti (WMF) (talk) 17:55, 1 June 2021 (UTC)[reply]
Congratulations, project proposers! I was delighted to support this project and very happy to be in the loop on its development. Making it easy to create citations will absolutely improve Wikipedia both for the convenience of contributors and for the reliability/verifiability of content for our readers. Kerry Raymond (talk) 22:28, 2 June 2021 (UTC)[reply]
Thank you, Kerry!! --Diegodlh (talk) 02:18, 3 June 2021 (UTC)[reply]

Related machine learning models?[edit]

Love this idea.

Would users of Web2Cit see a first-pass guess at what a translator for a given site should look like?

In addition to individual translators, is anyone in this space exploring ML models that could guess at a translator and suggest it to someone using Web2Cit? –SJ talk  17:40, 29 July 2021 (UTC)[reply]

@Sj:, thanks for your interest! We haven't yet fully decided how collaborative website metadata translation will work with Web2Cit. The original idea is described in the proposal and (briefly) involves a front-end sidebar that would guide the user through (1) selecting the HTML elements where relevant metadata appears in (including non-visible elements such as meta tags), (2) applying basic post-processing steps to these elements, and (3) mapping them to each of at least four basic metadata fields: title, authors, date and publisher. This approach may present some difficulties regarding collaboration, as has been raised above in this talk page. For now, multiple community translators may be available for one single URL matching pattern, ranked by user votes.
An alternative to this approach could involve collaborative web page annotation (instead of collaborative translator editing) followed by automatic machine learning translator inference. This is currently being looked into. We will probably take these alternatives for consideration to the first meeting of our Advisory Board, for which we have an open call right now BTW, in case you'd like to join. Anyway, we will provide a more detailed explanation of the approach we will use in a few weeks from now, on the Web2Cit page. --Diegodlh (talk) 20:16, 29 July 2021 (UTC)[reply]
In the proposal page, does "community translators" refer to human users who are providing human-intelligence based citations (using the frontend as described above), or to software programs like the Zotero translators? AshLin (talk) 05:00, 16 September 2021 (UTC)[reply]
Hi, User:AshLin. In the proposal page, "community translators" refers to software translators created by the community. The proposal has been revised and the project now has some differences with the original proposal. Mainly, instead of having "translators" created by individual users and voted up and down by the community, there will be a pool of "translation instructions" and another pool of "translation tests" collectively defined and maintained by the community. You can find more info about this in the Resources section of the Web2Cit page in Meta, although some of these are not self explanatory and others are still drafty. We are currently discussing this at the Web2Cit Advisory Board, and we will probably upload an explanation video soon. Please watch that page to follow the news! Cheers, Diegodlh (talk) 12:54, 16 September 2021 (UTC)[reply]

Delays and new midpoint report due date[edit]

Due to personal circumstances, the project's Development branch has been suffering some delays between around September 15 and November 15, 2021. For this reason I have requested that the midpoint report due date be extended until March 15, 2022. The project's timeline and milestones have been updated accordingly. It is worth noting that this delay has affected the Development branch alone, and that no budget changes are needed. Thank you for your understanding!

Debido a circustancias personales, la rama Desarrollo del proyecto se encuentra demorada entre aproximadamente el 15 de septiembre y el 15 de noviembre de 2021. Por esta razón, he solicitado una prórroga en la fecha de vencimiento del informe de medio término hasta el 15 de marzo de 2022. El cronograma y las metas del proyecto han sido actualizadas en este sentido. Cabe aclarar que esta demora ha afectado a la rama Desarrollo únicamente, y que no implica cambios en el presupuesto pactado. ¡Gracias por la comprensión! --Diegodlh (talk) 19:38, 4 November 2021 (UTC)[reply]

Hola @Diegodlh:
Gracias por tu comunicación. Entendemos que a veces se presentan circunstancias inesperadas por las que resulta necesario hacer cambios al cronograma planeado, por lo que aprobamos la extensión de entrega de reporte intermedio, que estaba programado para el 14 de febrero de 2022, y que ahora esperamos se entregue el 15 de marzo de 2022. Gracias también por precisar que el cambio en el cronograma no afecta el presupuesto aprobado al inicio de este financiamiento. Mucha suerte en el desarrollo del proyecto.
Mercedes Caso (platícame) 00:37, 14 January 2022 (UTC)[reply]

Actualización del presupuesto para investigación[edit]

Como charlamos con Mercedes el pasado miércoles 23 de marzo, y como adelantamos en la sección Finances del informe de medio término, a medida que fuimos desarrollando el subproyecto de investigación nos percatamos que habíamos subestimado el número de horas necesarias para llevarlo a cabo, siendo nuestra estimación actualizada de aproximadamente el doble (88h) que la original (44h). Si bien todas las partes estuvimos de acuerdo con la estimación original y podríamos seguir adelante sin cambiar nada, como administrador del proyecto siento que esto sería injusto para los miembros del equipo de investigación, con cuyo trabajo estamos sumamente satisfechos, como relatamos en detalle en el mencionado informe de medio término. Por lo tanto, nos gustaría usar parte de los fondos de emergencia para incrementar el presupuesto de investigación.

Concretamente, solicitamos mover $1320 de los fondos de emergencia (Additional buffer & contingencies costs) a los fondos de investigación (Research).

Cabe aclarar que por el momento no tenemos otro uso previsto para estos fondos de emergencia. Además, como comentamos en la llamada mencionada, es posible que sobren parte de los fondos para costos de transferencia (Wire fees, que hemos usado sólo parcialmente hasta ahora) y de los fondos para logística de talleres (Workshops logistics costs).

Agradeceremos si pudieran tomar en consideración nuestra solicitud. --Diegodlh (talk) 13:50, 28 March 2022 (UTC)[reply]

Hola @Diegodlh,
Muchas gracias por tu mensaje. Tal como lo conversamos, entendemos que el componente del proyecto a cargo por el equipo de investigación se extendió más de lo previsto y requieren utilizar parte de los fondos de emergencia para cubrir ese costo adicional.
Queda aprobado el uso de $1,320 USD del presupuesto original apara cubrir costos del equipo de investigación.
Si hay cualquier otra pregunta que quieran resolver, estoy disponible.
Un saludo, Mercedes Caso (platícame) 20:42, 28 March 2022 (UTC)[reply]