Gulp

From Meta, a Wikimedia project coordination wiki
This is a proposal based on T231891

Generic Unified List Processor (GULP)[edit]

This aims to define the capabilities of a to-be-created API to handle generic lists of pages/items.

Definitions[edit]

list
a set of revisions for a site
revision
a set of page entries
entry
Consists of a title, a namespace, and optional metadata
site
A wiki

Essentials[edit]

  • Create a new list
  • Add/remove page entries to/from the current revision of a list. This would not create a new revision automatically!
  • Create a snapshot, ie, freeze the current revision of a list
  • Retrieve a (current or snapshotted) revision of a list

Minimum viable product[edit]

  • Delete a revision or list (?)
  • Import from various sources
    • All sources offered in PagePile
    • Wiki pages
  • Export to various places
    • All consumers offered in PagePile
    • Listeria (V2?)
    • Wiki pages
  • Combine lists (subset, union, diff, etc.)
  • Filter lists

Nice-to-have[edit]

  • list and/or revision with optional expiration date (automatically updated lists)
    • possibly "temporary" lists with ~1h expiration date?
  • maximum number of revisions (delete oldest one if capacity is reached)
  • visual interface for pipelines (generate/filter/combine/output) based on lists
  • Diff functionality between revisions
    • For the interface (what has changed?)
    • For storage (less space required)
  • Meta-data per entry (store/update/remove/query)

Extended version[edit]

What is it's not just "page lists", but any (general, of one of pre-defined types) tables?

  • One table type would be "page title/page namespace", giving us the above lists.
  • Others could be, say, Mix'n'match catalogs ("external ID/url/name/description/instance of")

Technical notes[edit]

Storage options for revisions[edit]

  • sqlite files on disk, a la PagePile (or use PagePile transparently in the background). Works for large lists, especially for combination/filtering of lists.
  • Commons Data: namespace (aka ".tab files"). Size limited.
  • (MySQL) database. Might eventually outgrow capacity
  • Disk-backed object store

Any combination of the above could be used transparently, based on the list (large lists=>sqlite, small lists=>MySQL etc). Storage could even switch between revisions.

Identifiers[edit]

  • One ID (number) for lists, another for revisions (like MediaWiki)
  • One ID (number) for both. Using the list ID when asking for a revision would automatically use the latest revision. Simple but might confuse users.
  • Combined ID (String), eg "123.456" or "123/456" (the latter could be useful in URLs; missing revision would automatically use the latest revision)

Data structures[edit]

List
ID
Name
Description?
Site
User who created it (special privileges?)
Creation timestamp
Last update timestamp
Last revision
Optional:Maximum number of revisions (more => oldest gets deleted)
Optional:Maximum age of revisions (old ones get deleted)
Optional:Maximum age of list (auto-destruct for temporary lists)
Optional:Source
Revision
ID
Date of creation
entries
Is current or snapshot?
Previous Revision ID
Entry
Page name
Page namespace
Metadata (possibly JSON object for flexibility)