Grants:Project/Diegodlh/Web2Cit: Visual Editor for Citoid Web Translators/Timeline

From Meta, a Wikimedia project coordination wiki


Timeline for Diegodlh[edit]

Timeline Date
Development milestone 1: Publish Web2Cit mockup (DONE) 27 August 2021
Development milestone 2: Release Web2Cit translation engine (now "core library") v1 (DONE) 14 February 2022
Development milestone 3: Release Web2Cit translation API (now "translation server") v1 (DONE) 14 February 2022
Development milestone 4: Release Web2Cit frontend (aka "editor") v1 (DONE) 17 June 2022
Development milestone 5: Release Web2Cit translation cache (now "monitor") v1 30 September 2022
Development milestone 6 (optional): Release Wikipedia User script (DONE) 17 June 2022
Development milestone 7: Publish development documentation 30 September 2022
Research milestone 1: Publish up-to-date Citoid coverage gap data & report 30 September 2022
Research milestone 2: Release Citoid coverage gap estimator script 31 July 2022


Monthly updates[edit]

Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.

Month 1 (July)[edit]

  • Started meetings to organize the project between Diego & Scann.
  • Contacted potential researchers to do the research aspect of Web2Cit, that will help to establish the baseline to understand the impact of the project.
  • Issued Call for Members for Advisory Board at Call for members. The call was shared through several communication channels, including:
    • personal social media (our Twitter accounts)
    • outreach to specific people previously identified with Diego
    • Village Pump
    • Café (Spanish Wikipedia)
    • Mailing lists
  • Started organizing the Meta page for the project.

Month 2 (August)[edit]

  • Communications & Community:
    • Closed Advisory Board call for members. Opened Board's mailing list and collected time availability for meetings.
    • Called first two meetings of the Advisory Board: September 14th (non-technical profiles) & September 15th (technical profiles).
  • Development:
  • Research:
    • Confirmed the members of the research group.
    • Had initial team meetings and agreed research subproject's deliverables (in Spanish).
    • Familiarized with Citation Templates and defined an initial corpus of Wikipedia articles.
    • Started development of the script to automatically evaluate the Citoid coverage gap.

Month 3 (September)[edit]

  • Communications & Community:
    • Held first two meetings of the Advisory Board: non-technical and technical meetings.
    • Active discussions in the Board's mailing list (27 posts by 5 participants in the last 30 days).
    • We created a pre-recorded video to present Web2Cit in Spanish and in English at WikiConference North America 2021.
  • Development:
    • Web2Cit umbrella project tag was requested in Phabricator.
  • Research:
    • Started discussing and considering limitations and alternatives concerning methodological assumptions within the team and with the Advisory Board.
    • Started exploring tools and strategies to extract metadata from citation templates.

Month 4 (October)[edit]

  • Communications & Community:
  • Project management:
    • Web2Cit umbrella project tag was created in Phabricator to keep track of open tasks.
  • Research:
    • Created git repository, including work-in-progress Jupyter Notebook: https://github.com/hdcaicyt/Web2Cit-research
    • Developed script to (1) fetch wikitext from a list of Wikipedia articles and (2) extract citation metadata from their citation templates.
    • Ongoing discussions concerning alternatives to get accurate citation metadata from Wikipedia articles.

Month 5 (November)[edit]

  • Communications & Community:
    • Held second Advisory Board meeting on November 16.
  • Project management:
    • Requested changes in project's timeline.
  • Research:
    • Presented research subproject to the Advisory Board.
    • Discussed and agreed on strategy to support extracting citation templates from different Wikipedia languages.
    • Discussed and agreed on strategy to improve reliability of extracted data, using highlighted articles.

Month 6 (December)[edit]

Month 7 (January)[edit]

  • Communications & Community:
    • Published mockup presentation video, available at https://meta.wikimedia.org/wiki/Web2Cit
    • Had Advisory Board meeting. Starting from now on, Advisory Board meetings will be published on YouTube (but not listed) in order to document some software development decisions in an easier format than reading a bunch of documentation. We're preparing the video of January's meeting.
  • Development:
    • Created Wikimedia Gitlab source code repositories for Web2Cit Core and Web2Cit Server components.
    • Created the web2cit tool account in Toolforge, which probably will serve (at least) the translation API from https://web2cit.toolforge.org/, and the static frontend from https://static.wmflabs.org/web2cit/
    • Web2Cit Core library development
      • initial HTTP and Citoid caching of target URLs
      • basic selection steps, including Citoid and XPath selections.
      • basic transformation steps, including Join, Split, Date, and Range transformations.
      • translation procedures support (i.e., sequences of multiple selection and transformation steps)
  • Research:
    • Created a map from Zotero = Citoid fields to Web2Cit fields that will allow us to compare the Citoid response vs the citation metadata collected from Featured articles.
    • Moved the automated script to Wikimedia PAWS for better performance, with time improvements ranging between 67 and 99%:
    • Downloaded wikitext of 10.5k featured articles from 4 language Wikipedias.
    • Managed to extract >450k references with URL from 94% of these articles (~50 references per article).
    • Fetched Citoid response for first ~10k URLs collected. Optimizations pending.

Month 8 (February)[edit]

  • Communications & Community:
  • Development:
    • Completed development of our core library's initial version (development milestone 2). Briefly:
      • Integration of translation procedures into template fields, including procedure output validation.
      • Definition of Domain Configuration class, including methods to fetch and manage collaboratively-defined configuration revisions from our repository in Meta.
        • Definition of Template Configuration subclass, including method to translate a web target with a series of translation templates.
        • Definition of Pattern Configuration subclass, including method to sort URLs into URL path pattern groups.
      • Integration of all submodules into top-level Domain class
    • Initial translation server made available at https://web2cit.toolforge.org (development milestone 3, switched with development milestone 4 to allow for earlier testing), exposing the current capabilities of the core library. This includes:
      • target translation using manually defined translation templates and URL path patterns as described in our resources for early adopters;
      • translation results are included as embedded metadata ready to use by Wikipedia's automatic citation generator.
  • Research:
    • Optimization of high-volume requests to Citoid, including
    • Continued improvement of our collaborative citation template list, including addition of several parameter aliases.
    • Estimation of proportion of references excluded from our analysis; that is, inserted without using citation templates.

Month 9 (March)[edit]

  • Communications & Community:
  • Development:
    • Recorded Web2Cit core library architecture meeting as video documentation, including overview UML file.
    • Added selection and transformation step types:
      • Fixed selection
      • Match transformation
    • Implemented the sandbox endpoint of the translation server, and updated the early adotper guidelines accordingly.
  • Project management
    • Wrote and submitted the project's midpoint report.
  • Research:
    • Subsmitted preliminary results to the WikiWorkshop 2022 (our work got accepted and will be presented on April 25, 2022).
    • We continued working on URL validation to minimize unnecessary requests to Citoid (see T301519).
    • Started planning the details of how we will compare Citoid responses vs (presumably accurate) extracted metadata, including
      • mapping fields in Citoid responses to Web2Cit fields of interest, and
      • comparison strategies for fields where:
        • both extracted metadata and Citoid response are array of strings, or
        • extracted metadata is array of strings, and Citoid response is single string.

Month 10 (April)[edit]

  • Communications & Community:
  • Development:
    • Implemented the debug endpoint of the translation server, and updated the early adotper guidelines accordingly.
    • Made improvements to the translation server:
      • General layout improvements, including preparing the results page to show translation test results (i.e., supporting multiple translation targets, and expected outputs and test scores).
      • An improved debug information section, with a hopefully clearer table format.
      • Internationalization and Spanish translation.
      • Added a home page, with a search field to enter a target URL, and basic Web2Cit translation options.
      • Added JSON editor pages, with JSON editor forms embedded, to simplify editing of configuration files.
    • Considerably improved our JSON-schemas to simplify editing JSON configuration files for early adopters using json-editor automatically generated forms:
    • Created a user script to more seamlessly integrate Web2Cit into Wikipedia.
  • Research:
    • Finished making Citoid requests for the 380+ reference URLs obtained in the extraction phase
      • We managed to get citation metadata for 75% (280+) of these URLs
    • Finished mapping fields in Citoid responses to Web2Cit fields
    • Began cleaning the values obtained in the extraction phase, to continue with the comparison phase.
    • Presented our work at WikiWorkshop 2022, and met with Diego Sáenz-Trumper from Wikimedia Research.

Month 11 (May)[edit]

  • Communications & Community:
    • We held our first workshop.
      • Around 12 people participated.
      • Recording: https://www.youtube.com/watch?v=wlf3On0YgcI
      • We learned some valuable lessons for our upcoming workshops (see the corresponding Workshop summary section on our Workshops page).
      • At least 2 translation tests and 2 translation templates were created by the participants.
    • Published a video describing what the Web2Cit ecosystem looks like.
    • Held a session at Wikimedia Hackathon 2022. We had around 11 participants, including Citoid's maintainer Marielle Volz.
  • Development:
    • Made several improvements to the configuration file editor to encourage participation until our visual editor is available.
    • Began implementing support for translation tests, both in Web2Cit Core and in Web2Cit server.
    • Started drafting a Web2Cit Monitor redesign using Meta to simplify its implementation (see here).
    • Began development of XPath selection.

Month 12 (June)[edit]

  • Communications & Community:
    • Gave a Citoid and Web2Cit workshop at Wikimedia Argentina's and Wikimedistas Uruguay's "Wikiherramientas". YouTube video.
    • Had a conversation with Giovanna Fontenelle. Shared the tool with her and discussed people who may be interested in using it.
  • Development:
  • Research:
    • Added a landing page for the research subproject at our home page.
    • Resumed work with comparing the extracted references vs the results returned by Citoid.

Month 13 (July)[edit]

  • Communications & Community:
    • Held a remote workshop in Spanish organized by Wikimedia Colombia. See their flyer on Twitter.
  • Development:
    • Dennis Tobar joined the project and started working on the Web2Cit monitor.
    • Continued developing the real-time version of the Web2Cit editor.
    • Published Web2Cit core as npm library, for convenient reuse from Web2Cit server, Web2Cit editor, and any other project which would like to use Web2Cit capabilities.
  • Research:
    • Finished comparing extracted citation metadata vs Citoid responses, reaching the end of the research script writing process.
    • Began writing the final report.

Month 14 (August)[edit]

  • Communications & Community:
    • Our 5th Advisory Board meeting was held.
    • Web2Cit English workshop at LD4 Wikidata Affinity Group call. Notes and recording available here.
  • Development:
    • At our Advisory Board meeting it was generally agreed that our current visual editor (example) is useful already and that we should down-prioritize development of the new real-time editor to focus on pending tasks of the other components instead:
      • Accordingly, a list of tasks to be resolved before the grant period ends were identified in our Phabricator projects.
      • Nonetheless, a very early first prototype of our real-time editor was published, with viewing and no editing capabilities. More info here.
    • Continued development of the Web2Cit monitor. So far, it:
      • Includes a Python wrapper to communicate with the Web2Cit server.
      • Identifies domains with Web2Cit configuration by searching metawiki.
      • Writes results as wikitext, locally.
  • Research:
    • Started working on a new relatively small idea to automatically convert our research results into Web2Cit test configuration files.
    • Began revision of the first report draft.

Month 15 (September)[edit]

  • Communications & Community:
    • Completely restructured our homepage so that it is useful for potential users wanting to know what Web2Cit is and how to use it.
    • Restructured and largely developed our documentation pages, currently accumulating around 142 kilobytes of information (as a reference, the Argentina article on the Spanish Wikipedia is 113 kilobytes long):
    • Updated Web2Cit tools' information automatically created on Toolhub and created a list to gather them all, for easier discoverability.
    • Web2Cit is now available for collaborative (language) translation on translatewiki.net. For now, only the Web2Cit server interface is available for translation, not including the JSON editor. Enabling translation for the Web2Cit JSON editor is planned for both its interface and contents, and will be available under the same Web2Cit translatewiki.net project. In the meantime, automatic translation provided by some web browsers may be used for these.
  • Development:
    • Normalized README files on all software repositories and pointed to the corresponding detailed on-wiki documentation (above). See for example the w2c-core's README, here.
    • Mirrored all source code repositories to Github for improved discoverability, here.
    • Added JSON-LD selection, which we expect will make Web2Cit much more useful, since this popular way of embedding citation metadata is not supported yet by Citoid/Zotero.
    • Continued development of the Web2Cit monitor, including:
      • Test results, logs and summaries are not automatically written to Meta-Wiki.
      • Check queue generation.
      • Started development of queue consumption. This will be included in the final report.
  • Research:
    • Posted research report draft to Meta-Wiki, here. Final version will be ready for project's final report.
    • Continued working on the automatic generation of tests.json config files. A summary of the achievements will be included on the final report.
  • Project Management:
    • Created a separate Web2Cit-JSON-editor project in Phabricator to differentiate issues pertaining the Web2Cit's JSON editor from Web2Cit server issues (w2c-server) and from those concerning the Web2Cit integrated editor (w2c-editor).

Is your final report due but you need more time?