Web2Cit/Docs/Basics

From Meta, a Wikimedia project coordination wiki

Web2Cit is a system that provides citation metadata extraction services from web sources using collaboratively defined per-website configurations. This page provides an overview of how Web2Cit works and the components that it is made of, with links to further detailed pages.

For a general description of how to use Web2Cit, refer to our home page.

How Web2Cit works[edit]

Web2Cit configuration[edit]

An overview of how Web2Cit works

Web2Cit behavior is determined by a series of configuration files collaboratively defined by the community on a per-website basis.

Briefly, Web2Cit extracts citation metadata from web sources (i.e., translates them) using translation templates. These templates are based on specific webpages (paths) and consist of translation procedures, one for each citation field (e.g., title, author names, etc), each including a series of selection and transformation steps that specify how to retrieve and transform citation metadata, respectively. Although translation templates are based on specific webpages (paths), Web2Cit may use them to translate other similar pages from the same website (domain).

In some cases, multiple templates per website might be needed. These can be grouped into separate translation subgroups (within a domain) based on URL path patterns.

Finally, the Web2Cit community may also define translation tests indicating the expected output for specific webpages. These are used to guide collaborators with writing translation templates, and to regularly check the health of the Web2Cit system.

Web2Cit translation[edit]

Given a target webpage:

  1. Web2Cit first checks its domain to decide which set of configuration files it should use (see above).
  2. Then, it checks the target webpage's path, to see what URL path pattern group it belongs to. A catch-all pattern is used if no matching pattern is found.
  3. Finally, it tries all translation templates belonging to the same URL path pattern group in order until it finds one that can be applied on the target webpage. It returns a citation for the target webpage using this template. If no applicable template is found, Web2Cit uses a fallback template, which simply returns Wikipedia's automatic citation generator (Citoid) response for all fields. See the Templates documentation for more details.

Web2Cit ecosystem[edit]

A short video describing the components which make up the Web2Cit ecosystem and how they relate to one another.

Web2Cit is made of a set of separate components that interact with one another.

Web2Cit Core[edit]

The Web2Cit core is the software heart of the Web2Cit ecosystem. It is a JavaScript library or module which implements Web2Cit translation as outlined above.

You can read more about the Web2Cit core here.

Web2Cit Storage[edit]

The collaboratively defined per-domain configuration files are stored on Meta-Wiki. There are three types of configuration files per domain, patterns.json, templates.json and tests.json, for URL path pattern groups, translation templates and translation tests, respectively, as explained in the Web2Cit configuration section above.

You can read more about Web2Cit storage here.

Web2Cit Editors[edit]

The Web2Cit editors are tools to help Web2Cit contributors edit the per-domain configuration files used by Web2Cit.

The JSON editor enables editing individual configuration files using a form.

The integrated editor is a planned editor that would be injected as a sidebar editor and to edit all configuration files simultaneously and with real time translation results.

You can read more about Web2Cit editors at our Editing documentation page.

Web2Cit Server[edit]

The Web2Cit server is a web service that exposes Web2Cit core functionalities for consumption from other components of the Web2Cit ecosystem, namely the Web2Cit user script and the Web2Cit monitor, as well as from projects relying on Zotero translators.

More information about the Web2Cit server can be found here.

Web2Cit User script[edit]

The Web2Cit user script integrates Web2Cit into Wikipedia by subtly modifying the Visual Editor's interface.

Read more about the Web2Cit user script here.

Web2Cit Monitor[edit]

The Web2Cit monitor regularly checks that translation tests defined by Web2Cit collaborators match the expected outputs, and print the test results as wiki pages that Web2Cit collaborators can add to their watch lists to be notified if tests results change.

You can read more about the Web2Cit monitor here.