Web2Cit/Docs/Storage

From Meta, a Wikimedia project coordination wiki

The Web2Cit storage is the part of the Web2Cit ecosystem responsible for keeping the collaboratively defined configuration files that dictate the behavior of Web2Cit.

As will be explained in the following sections, these configuration files are JSON files saved as wiki pages on Meta-Wiki.

They are defined collaboratively, with the help of Web2Cit editing tools, on a per-domain basis, with up to three configuration files per domain: templates.json, patterns.json and tests.json.

Some of the concepts on this article are covered in a theoretical video for early adopters of Web2Cit on YouTube.

Location

Domain configuration files live in Meta, at Web2Cit/data/, one sub-directory deeper per hostname label, from the top-level domain all the way through the last subdomain.

You can find a full list of configuration files here.

For example, for hostname meta.wikimedia.org, configuration files would be at Web2Cit/data/org/wikimedia/meta/.

URL scheme (e.g., http, https, etc), port and path are not part of the hostname (see T315020, though). For example, for URL https://meta.wikimedia.org/wiki/Web2Cit/Early_adopters#Domain_configuration_files, only meta.wikimedia.org is the hostname.

There are three configuration files per domain: templates.json (for translation templates), patterns.json (for URL path patterns), and tests.json (for translation tests). So, for example, the translation templates configuration file for meta.wikimedia.org would be at Web2Cit/data/org/wikimedia/meta/templates.json.

Redirects

Redirects between configuration files are useful for domain aliases. For example, if www.example.com is an alias of example.com, configuration files of the former may be redirected to those of the latter so that the Web2Cit community does not have to maintain separate copies of the same files.

These redirects are followed both by Web2Cit core and by the JSON editor. Read the Domain aliases section of the Editing documentation for more information.

Format

All domain configuration files are written in JSON format (see below for alternative formats).

Generally speaking, our JSON files may have a combination of the following value types:

  • Text strings. For example "xpath".[note 1]
  • Booleans: true or false.
  • Arrays: or lists, with zero or more values, separated by commas. For example: [ "one", "two", "three" ].
  • Objects: with zero or more "key":value pairs separated by commas. For example:
{
  "key1": value1,
  "key2": value2
}

The MediaWiki editor is not specialized for editing JSON files (unless pages have the JSON content model, see T305571). You may find it useful to make your edits using a separate editor and then pasting the result.

For each configuration file below there is an example file and a JSON-schema file available. The JSON-schema can be used to validate your JSON files, using stand-alone validators or text editor integrations.[1]

We recommend using json-editor,[2] which lets you edit JSON files via a simple form generated from our JSON-schema files (direct links available at each configuration file section below):

  1. If you are editing a pre-existing JSON file, paste it into the json-editor's "JSON Output" field to the right, and click on "Update Form".
  2. Fill in the form.
  3. Copy the JSON output from the field to the right, and paste it into Meta.

templates.json

The templates.json file contains an array of Template objects at its root.

Template objects

Each Template object represents a translation template and has a series of three key:value pairs:

  1. path key, with a string as value, representing the path of the webpage used as translation template, in the current domain (note that multiple Template objects with the same path value will be ignored). Do not include the hostname; just the path beginning with /. You may also include query (?) components. For example, for template webpage https://example.com/news/article?id=3, use /news/article?id=3.
  2. label key, with a string as value, representing the (optional) fancy name for this translation template.
  3. fields key, with an array of TemplateField objects as value. Note that multiple TemplateField objects with the same fieldname value (see below) will be ignored:
{
  "path": string,
  "label": string,
  "fields": TemplateField[]
}

You can use the array of template fields from the default fallback template as a basis for your custom translation templates.[note 2]

TemplateField objects

In turn, each TemplateField object represents a template field in the translation template and has a series of three key:value pairs:

  1. fieldname key, with a string as value, representing the name of the template field. See the Fields documentation for currently supported values.
  2. required key, with a boolean (true or false) as value, representing whether the template field should be marked as required or not; see the Templates documentation.
  3. procedures key, with an array of Procedure objects as a value.
{
  "fieldname": string,
  "required": boolean,
  "procedures": Procedure[]
}


Procedure objects

In turn, each Procedure object represents a translation procedure and has a series of two key:value pairs:

  1. selections key, with an array of Selection objects as value.
  2. transformations key, with an array of Transformation objects as value.
{
  "selections": Selection[],
  "transformations": Transformation[]
}

Selection objects

Each Selection object represents a selection step and has a series of two key:value pairs:

  1. type key, with a string as value, representing the specific type of selection step. See the Selection steps subsection of the Templates documentation for currently supported values.
  2. config key, with a string as value,[note 3] representing the specific configuration for the selection step. See the Selection steps subsection of the Templates documentation for currently supported values.
{
  "type": string,
  "config": string
}

Transformation objects

Finally, each Transformation object represents a transformation step and has a series of three key:value pairs:

  1. type key, with a string as value, representing the specific type of transformation step. See the Transformation steps subsection of the Templates documentation for currently supported values.
  2. config key, with a string as value,[note 3] representing the specific configuration for the transformation step. See the Transformation steps subsection of the Templates documentation for currently supported values.
  3. itemwise key, with a boolean (true or false) as value, representing whether the transformation should be applied to each item of the input independently (true), or to the entire input as a whole (false).
{
  "type": string,
  "config": string
  "itemwise": boolean
}

patterns.json

The patterns.json file contains an array of Pattern objects at its root.

Pattern objects

Each Pattern represents a URL path pattern and has a series of two key:value pairs:

  1. pattern key, with a string as value, representing a glob path pattern that defines a URL matching group
  2. label key, with a string as value, representing the (optional) fancy name for this URL path pattern:
{
  "pattern": string,
  "label": string
}

tests.json


The tests.json file contains an array of Test objects at its root.

Test objects

Each Test object represents a translation test and has a series of two key:value pairs:

  1. path key, with a string as value, representing the path of the webpage used as translation test, in the current domain (note that multiple Test objects with the same path value will be ignored). Just like with the path property of Template objects, do not include the hostname and make sure the path begins with /. You may also include query (?) components.
  2. fields key, with an array of TestField objects as value. Note that multiple TestField objects with the same fieldname value (see below) will be ignored.
{
  "path": string,
  "fields": TestField[]
}

TestField objects

Each TestField object represents a test field in the translation test and has a series of two key:value pairs:

  1. fieldname key: any of the translation field names supported.
  2. goal value: an array of strings representing the expected translation output or translation goal for a given translation field. Each string value must comply with the translation field's validation rule. Provide an empty array to explicitly express that no output is expected.
{
  "fieldname": string,
  "goal": string[]
}

Alternative formats

Using alternative more human-readable formats, such as JSON5 or YAML, may help you read and write configuration files manually. We do not currently support any of them, although we may in the future, as tracked in task T302694.

For now, you may use online converters to:[note 4]

  1. Convert a JSON configuration file to either JSON5 or YAML
  2. Edit the configuration file in JSON5 or YAML
  3. Convert back to JSON and validate with JSON-schema (see above)
  4. Save configuration file in JSON

JSON5

JSON5[3] closely resembles JSON but is more flexible, thus tolerating some common JSON mistakes. In our case, the following features may be of interest:

  • keys may be unquoted: { unquoted: "value" }
  • strings may be single-quoted, allowing double quotes inside them: 'single "quoted" string'
  • trailing commas in objects and arrays are OK: { key1: value1, key2: value2, } [ a, b, c, ]

YAML

YAML is indentation-based (like the Python programming language) and is much shorter and (usually)[4] easier to write and read.

This is a side-by-side comparison between the JSON and YAML versions of an example template configuration file excerpt:

[
  {
    "path": "/",
    "label": "fancy name",
    "fields": [
      {
        "fieldname": "title",
        "required": true,
        "procedures": [
          {
            "selections": [
              {
                "type": "citoid",
                "config": "title"
              }, {
                ...
              }
            ],
            "transformations": [
              {
                "type": "range",
                "config": "0",
                "itemwise": false
              },
              {
                ...
              }
            ]
          }
        ]
      },
      {
        "fieldname": "itemType",
        ...
      }
    ]
  },
  {
    ...
  }
]
- path: /
  label: fancy name
  fields:
    - fieldname: title
      required: true
      procedures:
        - selections:
          - type: citoid
            config: title
          - ...
          transformations:
          - type: range
            config: '0'
            itemwise: false
          - ...
    - fieldname: itemType
      ...
- ...

Remember that the procedures key of a TemplateField object takes an array of Procedure objects as values, each with selections and transformations keys. So the following code is wrong, because it specifies two separate Procedure objects, one with a selections key, and another one with a transformations key:

...
        "procedures": [
          {
            "selections": [
              {
                "type": "citoid",
                "config": "title"
              }
            ]
          },
          {
            "transformations": [
              {
                "type": "range",
                "config": "0",
                "itemwise": false
              }
            ]
          }
...
...
      procedures:
        - selections:
          - type: citoid
            config: title
        - transformations:
          - type: range
            config: 0
            itemwise: false
...

Notes

  1. Text strings start and end with double quotes ". Therefore, avoid double quotes inside them. For example, in "some "quoted" text" it is not clear where the string starts or ends. If possible, replace double quotes with single quotes ': "some 'quoted' text". Alternatively, escape the inner double quotes with /: "some /"quoted/" text".
  2. Currently used fallback template definition available from the Web2Cit Core's source code repository here
  3. a b See T305903 for a proposal to use an array of config values instead.
  4. For example, toolkit.site's Data Format Converter

References