User talk:Diegodlh/VisualCitoidTranslatorEditor

From Meta, a Wikimedia project coordination wiki

Citoid support[edit]

Before starting to write a proposal, we would really appreciate User:Mvolz_(WMF)'s (Citoid) feedback (copying this over from Shared Citations proposal talk page).

For example, I think users may be unwilling to use the extension if they have to wait until their new translators are (1) pulled by the Zotero's translators repo, and (2) incorporated into Wikimedia's mirror.

Maybe the browser extension could run a Zotero translator server itself to extract citation metadata with user's translators (until they are used by Citoid) and output a raw citation template (e.g., Cite web template) that the user can copy and paste. For that, the VisualEditor's Citation tool would have to be changed to accept raw Citation templates as well. What do you think?

Alternatively, I was thinking of having the Citoid API accept a custom translator provided by the user, but I'm not sure if the translator server is prepared (security-wise) to run user-provided JS code. --Diegodlh (talk) 00:22, 9 March 2021 (UTC)[reply]

@Czar: @Mvolz (WMF): @Whatamidoing (WMF): @Ainali: Zotero developers have shown some skepticism that such a visual editor would result in useful translators. Although I agree with some of their concerns, I disagree with others (see summary of our discussion in the Zotero section below).
Given this situation, I would specially appreciate the Citoid team feedback to decide whether or not to apply with this idea for these software grants closing on Tuesday next week.
Particularly, I would appreciate your comments regarding the following two points:
(1) Do you think a visual editor, inspired by tools such as AnyPicker and opensource Portia, may produce acceptable Citoid web translators?
(2) If yes, given Zotero's hesitance about pulling these community translators into their repository, would you support one or more of the following?:
(a) have the Citoid service use a separate repository, including both Zotero and community translators. I thought it was using Wikimedia's fork, but I see it hasn't been updated since 2018.
(b) have the Citoid API accept a custom translator to be used by the translator server to resolve the URL provided. That is, if a user wants the Citod API to resolve a URL using a translator they built using the visual editor, it would send the API both the URL they want to resolve, and the translator they created.
(c) add a fourth tab to the Visual Editor/Citation tool that would take a raw Citation template. This way, the browser extension can run a translator server on its own, use community translators to resolve a URL, and output a raw citation template that the user can copy/past into the Citation tool.
Thank you! --Diegodlh (talk) 19:07, 11 March 2021 (UTC)[reply]
Regarding maintainability, my suspicion is that when Wikipedia editors know that a source yields nice results in Citoid, this source will be used more often (if there is a choice between similar sources). Hence, more eyeballs will be on the results from them and more likely to be reported if there are troubles in the future. So even if the actual update is harder, it might be mitigatedd by more attention being directed to it. Ainali talkcontributions 18:18, 12 March 2021 (UTC)[reply]
Have you tried pasting wikitext into VisualEditor's "Basic form" (Cite > Manual > Basic form)? I think it already does what you want. Whatamidoing (WMF) (talk) 17:40, 15 March 2021 (UTC)[reply]
Unfortunately having a fork mostly creates more problems than it solves. Any new translators would still have to pass code review, as we cannot deploy code in production without code review for security reasons. Zotero's team is small, but still overwhelmingly more qualified to provide this review; we don't have anyone at the foundation that would be able to provide competent review at present. (The translator fork, now archived, thanks for noticing this- only ever held very minor changes, no new translators). Plus if any new translators are good enough to go in our fork they'd certainly benefit the Zotero project too, so we'd want them upstream anyway as it is more maintainable. Mvolz (WMF) (talk) 09:23, 16 March 2021 (UTC)[reply]
@Ainali: @Whatamidoing (WMF): @Mvolz (WMF): Thank you all for your feedback! I and User:Scann have continued refining the idea and we posted a proposal for a software grant here today. We would appreciate your comments and thoughts in the discussion page, as well as your endorsements if you would like to support it. Thank you! --Diegodlh (talk) 21:49, 16 March 2021 (UTC)[reply]

Zotero support[edit]

(copying this from the Shared Citations talk page)

I have had a thorough discussion about this idea with Zotero developers Dan Stillman and Sebastian Karcher on the Zotero forums. Their main concerns are around the quality and maintainability of translators created with a visual tool aimed at non-technical users. Although I agree that "metadata quality and saving reliability does matter when generating citations, so bad translators can sometimes be worse than non-existent ones", I also agree that there might be "a perfect-is-the-enemy-of-the-good argument to be made".

Regarding their suggestion that it may be better to focus on improving embedded metadata (e.g., JSON-LD) support instead, although I agree this would increase translator coverage, it would still left websites which do not embed metadata (or which embed wrong metadata) out of the picture.

See the forum thread for the full discussion.

--Diegodlh (talk) 00:25, 9 March 2021 (UTC)[reply]

Statement of PerfektesChaos[edit]

  1. It is nice that you make an attempt about this.
  2. About myself:
    • I am running a source text gadget for citoid since 2015: citoidWikitext
    • It has an interface to postprocess some Zotero results in general: citoidWikitext/opus (JS)
    • It has an interface to postprocess Zotero results per project (e.g. dewiki).
    • Interfaces are necessary to adapt immediately to current needs, not waiting until upstream might arrive half a year later, if ever. Furthermore, the site interface enables particular local templates to be used as desired; e.g. w:de:Template:Der Spiegel. That is a site-specific translator as mentioned.
    • It is frequently used at least within German Wikipedia.
  3. I am very much in doubt that this idea could work.
    • Reason: Most things are conditional.
    • That will say: It is needed to make a lot of decisions and following branches. IF this THEN format that ELSE IF url containing those THEN do something different ELSE IF url containing those THEN do something else.
    • It is not a trivial mapping of constant field values, nor regular expressions. Many fields and components of the answer are influencing the final interpretation.
    • I cannot imagine any visual scrapper editor, creating translators between website response and neutral Zotero record.
    • Nor do I believe that anybody without general technical skills and deep understanding of bibliographic records and knowledge of Zotero methodology and wiki template programming would be able to utilize such editor, even if provided.
    • Nor do I think that site-specific translators are made by non-technical people. However, a mapping from Zotero record to particular content dependant template might be feasible, triggered by certain characteristic strings in Zotero fields and transforming values and mapping into specific template for that purpose. Anyway, this is a kind of programming and a set of functions will be required, e.g. adapting formatting of numbers, dates, name of individuals.
  4. BTW, PDF files do contain metadata as well, even if under security restrictions and crypted content, and expose metadata uncrypted as XML. That may be exploited and translated into Zotero record.

Greetings --PerfektesChaos (talk) 19:34, 9 March 2021 (UTC)[reply]

@PerfektesChaos: Thanks for your feedback!
Regarding citoidWikitext, I tried to use it following the Usage guidelines: I didn't find it as a gadget on either English or German Wikipedia preferences, and adding the code suggested to my common.js in the English Wikipedia did not result in any apparent changes in the source code editor. So I'm not 100% sure what it does, although I think I got an idea from your description. It sounds really useful, specially the post-processing capabilities, to adapt to immediate needs without having to wait for upstream changes. I wonder whether the ProveIt gadget does something similar?
Regarding your comments about the visual Zotero/Citoid translators editor, I do not agree that a "deep understanding of bibliographic records and knowledge of Zotero methodology and wiki template programming" would be needed to use one such editor. Mapping between Zotero translator output and wiki template is already handled by Citoid, and the plugin could make sure that different Zotero fields are explained with enough detail so users understand what should go in each of them. Nonetheless, I fear that, as you say, "most things [might be] conditionals (...) with many fields and components of answer influencing the final interpretation". I wonder how existing visual translator tools, such as AnyPicker or open source Portia are performing.
> However, a mapping from Zotero record to particular content dependant template might be feasible, triggered by certain characteristic strings in Zotero fields and transforming values and mapping into specific template for that purpose.
I didn't understand this. You mean expanding what Citoid does (i.e., mapping Zotero translator output to Cite Web, Cite Journal and Cite Book templates) to other templates, such as the Der-Spiegel template you mentioned above?
Finally, regarding PDFs, do you mean having Citoid support extracting metadata from PDFs? This sounds interesting. Zotero already does this, using a Zotero web service (at https://services.zotero.org/recognizer/recognize) that gets metadata extracted from the PDF by Zotero and returns bibliographic metadata, further processed (or not) by Zotero translators.
--Diegodlh (talk) 16:27, 10 March 2021 (UTC)[reply]
  • German Wikipedia terminated in 2010 complex gadgets at site level.
    • Things like those are user scripts.
    • The installation section says how to load scripts which are not defined as gadgets.
    • The result of script activation depends on skin and device and other toolboxes. Only Desktop skins do have tool collections. If any known toolbox is detected a link or button or icon is added. Basically in main namespace of own user pages..
  • On 1:1 mapping.
    • That is an easy exercise.
    • It does actually mean to rename a keyword, the identifier.
    • The content is kept.
    • Metadata collections like Dublin Core may be mapped 1:1.
    • “Mapping between Zotero translator output and wiki template is already handled by Citoid” – this exercise is a very trivial keyword renaming game.
    • That may be done via that envisioned interactive form.
    • However, there is not much leeway left for simple 1:1 mappings.
  • “Websites which expose metadata appropriately are understood by generic translators. This is often not the case”
    • Things in phab:diffusion/GZTT are mostly not a trivial 1:1 mapping.
    • What you want is to retrieve key data from an arbitrary website which does not expose metadata in a structured manner.
    • That does need real programming, and this is not feasible via a simple interactive form, then job finished.
    • And writing translators to extract metadata from any website does need high technical skills.
    • Look at this one, e.g.: Spiegel Online
      • There is a lot of work to do, until the content is analyzed and grabbed, then finally the simple metadata structure of Zotero has been retrieved.
      • You won’t get regular non technical authors to create all that programming into a simple form with simple rules.
      • Note that this Zotero translator does retrieve url, title, tags, creators, section, attachments, itemType, volume, abstractNote, date.
      • Zotero does not retrieve the issue field.
      • That I do for myself in opus@citoidWikitext by refining the Zotero title field:
re  = /^(.+)\b([0-5]?\d)\/((?:19|20)\d\d)$/;
got = re.exec( answer.title );
if ( got ) {
   answer.issue = got[ 2 ];
   answer.year  = got[ 3 ];
   answer.title = got[ 1 ];
  • I don’t think that this is a job for regular Wikipedians without any technical background. You do need knowledge about regular expressions, at least, if you want to generate metadata out of an arbitrary website.
  • And yes, you do need semantic knowledge about HTML elements, and Zotero fields, and bibliographic records, and data formatting.
  • On particular template: I mentioned already w:de:Template:Der Spiegel.
    • That is demanding a
      • numerical article ID,
      • the title of the article,
      • the year,
      • the issue number.
    • There is still some non-trivial extracting to do until filling template parameters from standardized Zotero fields. Three fields are trivial 1:1 mappings, but two need further processing:
switch ( e[ 0 ] ) {
   case "ID":
      if ( typeof assembly.url  ===  "string" ) {
         v = assembly.url.replace( /^.+print\/d-(\d+)\.html$/,
                                   "$1" );
      }
      break;
   case "Autor":
      v = WIKI.family( assembly, "authors" );
      if ( ! v   &&
           typeof assembly.year  ===  "number"   &&
           assembly.year < 2000 ) {
         r[ i ] = false;
      }
      break;
   case "Titel":
      v = CITWT.opus.fetch( assembly, "title" );
      break;
   case "Jahr":
      v = CITWT.opus.fetch( assembly, "year" );
      break;
   case "Nr":
      v = CITWT.opus.fetch( assembly, "issue" );
      break;
}   // switch e[ 0 ]
  • I did not get the impression that PDF resources are evaluated on WMF wikis right now. So I added a note.
Greetings --PerfektesChaos (talk) 13:01, 12 March 2021 (UTC)[reply]
@PerfektesChaos: Thank you again for your comments. Based on all the feedback we received, User:Scann and I have continued refining the idea, and we posted a proposal today. We referred to the threads in this discussion page in our proposal, but feel free to engage in further discussion in the proposal's discussion page as well. Thank you! --Diegodlh (talk) 21:54, 16 March 2021 (UTC)[reply]

Cool![edit]

Hi, I heard about this idea on Telegram and was immediately intrigued. I am a big fan of VisualEditor, but I sometimes get frustrated by (1) the poor 'recognition' ('rendering'?) of some popular German news sites by Citoid, and (2) the confusion created by Citoid's seemingly random usage of two different citation templates on German Wikipedia (de:Template:Internetquelle and de:Template:Literatur). Happy to talk more about this if you are looking for a motivated, but mostly clueless user to interview. --Gnom (talk) 22:58, 13 March 2021 (UTC)[reply]

@Gnom: Thanks for your interest! User:Scann and I have continued developing the idea and we posted a proposal today. We would appreciate your thoughts, comments and questions in the proposal's discussion page, as well as your endorsement if you would like to support it! By the way, we are looking for volunteers to (1) translate the software to languages other than English and Spanish, and (2) to participate in a couple workshops about the tool, in case the proposal is approved. Feel free to add your name there if you wish as well! Thank you, --Diegodlh (talk) 21:59, 16 March 2021 (UTC)[reply]
Hi Gnom! It would also be very nice to chat about the problems you are experiencing with sources and help us collect some sources which you have experienced problems with -- we're planning on doing some small sessions on user needs. Thanks for your interest! Scann (talk) 22:13, 16 March 2021 (UTC)[reply]