New Readers/Offline/Content curation

From Meta, a Wikimedia project coordination wiki

There are several use cases where a reader might need access to a set of articles offline:

  • The Wikipedia Android app
  • Any fully offline deployment, such as through Kiwix or other offline open educational resources.

In both of these cases, users want the right content for the context that they're in. For those on the Android app or other mobile apps with limited storage, it's critical to have small content packs that are very focused on a single topic, which we learned through design research earlier this year. For fully offline contexts, like a classroom or a medical clinic, there may be more space but still a need for targeted content.

A UC Berkeley student group, called Diversatech, has offered to help the New Readers team develop a prototype of what a tool could look like to specifically help users select articles to be included in a pack. This is exploratory work that may or may not lead to any tool in production, and we expect that the scope will not include any file generation, hosting, or browsing at this time - just the act of curating a list. We will gather requirements here.

Prototype information[edit]

The prototype currently creates a list of links of articles, which you can copy to your clipboard. You can see the prototype here: https://tools.wmflabs.org/zimmerbot/

Documentation/specs to come.

Requirements[edit]

If possible, the tool should work on both mobile and desktop to support readers who primarily access the internet via smartphone.

Users should be able to select articles either individually or through groups of content:

  • WikiProjects
  • Categories
  • An article + linked articles
  • An article + related article

In addition, they should be able to filter by various criteria. Some possibilities:

  • Article score (a weighted aggregate of quality and importance, only available on en:WP)
  • Popularity (page views)
  • Quality (either through WikiProject assessment or automatic ORES assessment)
  • Most linked to
  • Exclude stubs, i.e. very short articles (either through WikiProject assessment or automatic ORES assessment)

Users should be able to view the estimated file size and, ideally, specify a file size output that limits the number of articles in the list. We could imagine allowing users to specify something like "100MB of the most-read articles about cats," for example.

Users should be able to specify if the file should have text only, text and images, or text and all media.

Outputs[edit]

The tool should deliver:

Technical guidance[edit]

From Ryan Kaldari

First, I would strongly recommend setting up both a GitHub repo for the project and a project account on the Wikimedia Foundation's Toolforge server. Hosting the project on Toolforge has the big advantage of having direct server-side access to replicas of the Wikipedia databases. The portal for Toolforge documentation can be found at https://wikitech.wikimedia.org/wiki/Portal:Toolforge. Reading through https://wikitech.wikimedia.org/wiki/Help:Getting_Started will be especially helpful. I also recommend joining the #wikimedia-cloud IRC channel. Basically, you'll need to create some LDAP accounts, get ssh access to Toolforge, and then create a new Toolforge project.

As far as building the actual tool, if you're willing to go with Python, there is a pretty mature and well-maintained API interface library at https://github.com/wikimedia/pywikibot-core. It has good documentation (at https://www.mediawiki.org/wiki/Manual:Pywikibot and https://doc.wikimedia.org/pywikibot/) and there are tons of existing scripts out there to look at that use the library.

It's also possible to just write your own interface to the MediaWiki API directly if you prefer doing that, but I don't really recommend it. The documentation on the MediaWiki API can be found at https://www.mediawiki.org/wiki/API:Main_page.

Besides the MediaWiki core API, there are a few other APIs that you may be interested in pulling data from:

Other guidance[edit]

User:Risker on guidelines for new content types: https://en.wikipedia.org/wiki/User:Risker/Risker%27s_checklist_for_content-creation_extensions