Web2Cit/Docs/Templates

From Meta, a Wikimedia project coordination wiki

Translation templates are based on specific webpages and include a series of translation procedures for each translation field. These translation procedures include a series of selection and and transformation steps which specify how to extract values from the template webpage and how to transform them to get valid outputs for the corresponding translation field.

Although translation templates are based on specific webpages, they may be used to translate other webpages from the same domain, according to the applicability criteria outlined below.

Multiple translation templates may be defined for a website/domain. As introduced in the Basics documentation page, when a target webpage is given to Web2Cit for translation, translation templates belonging to the same URL path patterns as the target web page are tried one after the other until one that applies is found.

Template definition

The templates configuration file for a website or domain may include one or more translation templates.

Template path

Each translation template is defined based on a specific webpage. This webpage's path is the translation template's path.

If multiple translation templates have been defined for the same or equivalent template paths, Web2Cit core only considers the first one and ignores the others.

Note that query strings are used by Web2Cit (i.e., they are passed to selection steps such as Citoid, XPath, and URL), but fragment identifiers are ignored.

Template label

Optionally, translation templates may have a label, which help Web2Cit collaborators understand why the template may have been defined.

Template fields

Translation templates include a series of template fields, matching the translation fields supported by Web2Cit. See our Fields documentation for a list of supported template fields and corresponding valid outputs.

Each template field has a validation rule, indicating what translation outputs are valid for that field.

If multiple template fields are defined for the same translation field, Web2Cit core currently considers the first one only.

Field requirement

Template fields may be marked as required or not. A required field is a field that must return a valid output for the template that it belongs to to be marked as applicable for the target webpage.

Mandatory fields

Mandatory fields are fields that must always be present in a translation template. Translation templates that do not include them will be ignored by Web2Cit. These fields are itemType and title.

Mandatory fields are also always marked as required.

Control field

The control field is a special type of template field that is not included in the template output. However, because the control field does count to decide whether a template is applicable or not for a given target webpage, it is useful to control which translation templates should be used to translate a given target webpage, in cases where URL path patterns cannot be used.

Translation procedures

Each template field may include one or more translation procedures (typically one). Translation procedures include a series of selection and transformation steps which respectively select elements on the target webpage that contain the relevant metadata, and transform them into a format that is valid for the current template field.

Read the following sections to find out about the selection and transformation steps available and how to use them.

Selection steps

Selection steps take the translation target webpage as input, and return a list of zero or more selected values (strings) from that webpage.

There are different selection step types, and their behavior is customized using a config parameter. Selection steps whose configuration values fails validation, will be ignored.

Citoid selection

The Citoid selection step selects a field from the Citoid response for the translation target.

  • type: citoid
  • config: any valid Citoid/Zotero base field name.[note 1] You can check what Citoid returns for a given URL using the citation endpoint of Wikimedia REST API;[1] make sure you use the mediawiki-basefields format. Note that all creator fields are split into creatorFirst and creatorLast fields. For example, the author field is split into authorFirst and authorLast fields.

XPath selection

The XPath selection step selects a node from the translation target's HTML using XPath.

  • type: xpath
  • config: any valid XPath v1.0 expression. You can use your web browser's inspector (shown with F12 in some browsers) to get an XPath expression for an HTML node. Note that more than one XPath may be used for the same node, and some may be more robust than others. In addition, you may use some browser extensions to test your XPath expressions by highlighting all matching nodes.[note 2]

Note that in some webpages content is added on the fly using JavaScript. Web2Cit, like Citoid, does not run JavaScript on webpages. Hence, in these cases, what you see may not be what Web2Cit sees. Consider temporarily disabling JavaScript, for example by using a browser extension such as uBlock Origin.

Fixed selection

The Fixed selection step always returns the same predefined value.

  • type: fixed
  • config: the predefined value to be returned.

JSON-LD selection

The JSON-LD selection selects one or more elements from JSON-LD objects present in the target webpage.

  • type: json-ld
  • config: any valid JMESPath expression, to be evaluated against an array including all JSON-LD objects found in the target webpage.

To test JMESPath expressions:

  1. Open the target webpage
  2. Use the Copy JSON-LD bookmarklet (see below) to copy an array including all JSON-LD objects available into your clipboard.
  3. On the JMESPath website, paste the array copied above
  4. Try different JMESPath expressions and check the results.[note 3]

Note that JSON-LD selection will always return an array of stringified values, ignoring not found (i.e., null) elements.

Copy JSON-LD bookmarklet

The following Copy JSON-LD bookmarklet would help you copy an array including all JSON-LD objects found on a target webpage, as described above:

  1. Add a new bookmark to your browser, preferably to your bookmark toolbar for easier access. You may name it "Copy JSON-LD".
  2. In the bookmark's URL field, type `javascript:` followed by the code below:
function concatAndCopy() {
  const nodes = Array.from(
    document.querySelectorAll('script[type="application/ld+json"')
  );
  let jsonld = [];
  const errors = [];
  nodes.forEach((script, index) => {
    let content = script.textContent;
    if (content !== null) {
      try {
        content = content.replace(/[\x00-\x1F\x7F\x80-\x9F]/g, " ");
        const json = JSON.parse(content);
        jsonld = jsonld.concat(json);
      } catch {
        errors.push(index);
      }
    }
  });
  navigator.clipboard.writeText(JSON.stringify(jsonld, undefined, 2)).then(
    () => {
      let message = "JSON-LD copied to clipboard!";
      if (errors.length > 0) {
        message += (
          "\n\nCould not parse JSON-LD objects: " +
          `${errors.map((index) => `#${index + 1}`).join(", ")}`
        );
      }
      alert(message);
    }
  );
};
concatAndCopy();

URL selection (proposed)

The URL selection is a proposed selection step that would allow selecting components from the target webpage's URL, such as its path or query string parameters. See T304326.

CSS selection (proposed)

The CSS selection is a proposed selection step that would allow selecting nodes from the target webpage's HTML using CSS (instead of XPath) selectors. See T308668.

Header selection (proposed)

The Header selection is a proposed selection step that would allow selecting headers from the target webpage's server response. See T304333.

Text fragment selection (proposed)

See T309658.

Transformation steps

Transformation steps take a list of zero or more values as input; i.e., the output (1) of one or more selection steps, or (2) of another transformation step. They return a transformed list of zero or more values (strings).

There are different selection step types, and their behavior is customized using (1) a config parameter, and (2) an itemwise parameter, indicating whether transformation should be applied on a per-input-item basis, or on the entire input as a whole. Transformation steps whose configuration values fails validation, will be ignored.

Join transformation

The Join transformation step joins two or more items in a list into one, using the separator specified.

  • type: join
  • config: the separator to use
  • itemwise (default = false): if set to true, the transformation is applied to each string in the input list independently, taking each string as a list of characters.

Split transformation

The Split transformation step splits a string at the separator specified into two or more substrings.

  • type: split
  • config: the separator to use
  • itemwise (default = true): if set to false, the input list of strings is first joined into a single string to which the split is applied.

Date transformation

The Date transformation step uses the Sugar library[2] to parse natural language dates into a standard YYYY-MM-DD format. If not possible, it returns the original value.

  • type: date
  • config: one of the currently supported locales: ca, da, de, en, es, fi, fr, it, ja, ko, nl, no, pl, pt, ru, sv, zh-CN, zh-TW.
  • itemwise (default = true)

Range transformation

The Range transformation selects one or more items or ranges of items and returns them in the order specified. Numbering is one-based (i.e., the first item is 1).

  • type: range
  • config: one or more ranges, separated by commas. A range can be in one of the following forms:
    • start:end: selects elements from start through end, end included.
    • start:: selects elements from start through the last item in the list.
    • :end: selects elements from the beginning of the list through end, end included.
    • start: selects single element at start index.
  • itemwise (default = false): if set to true, the transformation is applied to each string in the input list independently, taking each string as a list of characters.

Match transformation

The Match transformation returns one or more substrings matching a target.

  • type: match.
  • config: the matching target, expressed as either plain string or regular expression (in JavaScript "flavor"). To use regular expressions, wrap your pattern between /, followed by any optional flags.[3] For example, /(sub)?string/i matches either string or substring, case insensitively.[note 4] If you need to match a string that may be interpreted as a regular expression (i.e., a string matching the pattern \/.*\/[a-z]*), express it as a regular expression instead. For example, to match string /.*/ literally and prevent it from being interpreted as regular expression .*, express it as regular expression //\\.\\+// instead (note double-escaped special characters: [4] \\. and \\+).
  • itemwise (default = true)


Each match is returned as a separate output item. For example, matching a substring inside a string against target string returns a two-item array output: ["string", "string"], one for each match. If using capturing groups in regular expressions,[5] only group matches are returned (i.e., not the full match).

If no matches are found for a given input item, the input item is ignored (i.e., not included in the transformation output). For example, matching the two-item array input ["a string with a substring", "a string without it"] against target substring returns a one-item array output: ["substring"].

Replace transformation (proposed)

The Replace transformation is planned to be similar to the Match transformation described above, but with an additional replace parameter to replace the target matches with. See T302691.

In the meantime, you may consider using a Split transformation followed by a Join transformation as a workaround.

Case transformation (proposed)

The Case transformation is a proposed transformation step that would transform input strings into lower, UPPER, Sentence or Title Case. See T302692.

Custom transformation (proposed)

The Custom transformation is a proposed transformation step that would apply a custom defined JavaScript function. It has also been proposed that these custom functions may be collaboratively maintained. See T305883 and T305886.

Fallback template

The fallback template is a default translation template that will be used to translate a given target webpage when no applicable templates have been found for it (see the Translation section below).

This fallback template is meant to return a translation as close to the Citoid translation as Web2Cit allows, by using Citoid selection in all its template fields. The fallback template should be applicable for most target webpages, except a few cases (see T305168).

The fallback template is defined in the Web2Cit Core repository as follows:

Fieldname Selection

(type: config)

Transformation

(type: config)

Notes
Item type Citoid: itemType -
Title Citoid: title -
Author first names Citoid: authorFirst -
Author last names Citoid: authorLast -
Publication date Citoid: date Date: en[note 5]
Published in
  1. Citoid: publicationTitle
  2. Citoid: code
  3. Citoid: reporter
Range: 1[note 5] Citoid/Zotero includes "container title" metadata in different base fields depending on the source's type.[6]
Published by Citoid: publisher -
Language Citoid: language

Note that there is a proposal to change the fallback approach of Web2Cit translation; see T302019.

Translation

Translation templates are based on specific webpages, but they may be used to translate other target webpages from the same domain and URL path pattern group, provided that some applicability criteria is met.

Briefly, given a target webpage, first a list of candidate templates to try translating it with is set up. Then, each of these templates is tried on the target webpage in order, returning an output for each of its template fields. Finally, translation ends when an applicable template is found, or when no candidate templates are left to try.

Candidate templates

First, a list of translation templates that will be tried on the target webpage is set up:

  1. If a translation template has been defined for the target webpage, it will be tried first.
  2. Next, all translation templates sorted into the same URL path pattern group as the target webpage, in the order in which they have been defined.
  3. Finally, the fallback template.

Then, these templates are tried one by one on the target webpage, as described below.

Template field outputs

Translation templates include template fields, which in turn include translation procedures, as described in the Template definition section above. For each of these translation procedures:

  1. First, selection steps are applied on the target webpage. The outputs from all selection steps are concatenated into a single selection output,[note 6] and passed to the first transformation step below.
  2. Transformation steps are applied sequentially, the output of each step being passed to the following step on the list.[note 6] The output of the last transformation step becomes the transformation output, which is the translation procedure's output.

The translation procedure's output is the transformation output, or the selection output in case no transformation steps have been defined.

The outputs from all translation procedures defined for a template field are concatenated in order into a single field output. If the template field corresponds to single-valued translation fields, the output values are joined with a , delimiter.

Finally, the translation field's validation pattern (see the Fields documentation) is used to decide whether the field output is valid or not. All output values must be valid according to this validation pattern for the field output to be valid. Empty outputs ([]) are always invalid.

Template applicability

The procedure described above is repeated for all template fields of a translation template. If all template fields marked as required return valid outputs, the translation template is deemed applicable for the target webpage, translation stops there and a citation is returned.

Citation

Translation templates that result applicable for a target webpage, can return a citation for it. These citations include metadata for the target webpage following the format used by Citoid/Zotero.

To generate a citation, template outputs are mapped to the corresponding Citoid/Zotero field. Mapping from Web2Cit to Citoid/Zotero fields is described in the Fields documentation.

Invalid outputs for template fields that have been marked as non-required are simply excluded from the citation.

Notes

  1. Citoid/Zotero fields can be base or derived fields. Some item types have derived fields for some fields, which map back to base fields. In Web2Cit we only use base fields. You can find a list of derived and based regular and creator fields here.
  2. You can give "Try XPath" for Firefox or "CSS and XPath checker" for Chrome a try.
  3. The JMESPath website does not currently show when an error has been found either in the JSON input or in the JMESPath expression. This has been reported to the maintainers and is pending resolution here.
  4. You may use tools to help you test your regular expressions, for example regex101.
  5. a b See T308354 for discussion on whether transformation steps may be omitted from the fallback template.
  6. a b As per T305163, selection or transformation steps failing on the application stage (i.e., when they are applied to the given input) return an empty output ([]).

References