Grants:IEG/WikiBrainTools/UseCases

From Meta, a Wikimedia project coordination wiki

This page asks Wikipedia researchers and tool developers to tell us which Wikipedia-based algorithms would support their work. This feedback will drive the design of WikiBrain and the WikiBrainTools individual engagement grant.

Are you a researcher or tool developer? Help us![edit]

We want to make sure that the new WikiBrain API we create supports tool developers and researchers (that's you!). Add a sentence or two below telling us about your algorithmic needs! The sections that follow contain more information about the project.

  • Describe the Wikipedia-based algorithm you need, and the application / bot / research it supports. Also let us know if you're interested in being a WikiBrainTools API pilot user.

SuggestBot[edit]

As I mentioned SuggestBot in my endorsement of this project, I thought I could mention that it could benefit from several of the features available in WikiBrain. The easiest to point out are the semantic relatedness algorithms, which could potentially increase the relevance of the articles SuggestBot suggests. SuggestBot could also benefit from GeoScience algorithms to discover or filter articles within a specific geographic area. Lastly, WikiBrain's access to page view data could also help. SuggestBot's been reporting article popularity for several years, but currently relies of the availability of that data from stats.grok.se.

As far as I understand, WikiBrain does not currently implement any collaborative/social filtering algorithms, which SuggestBot has, so if those were implemented the bot could also benefit from those. Regards, Nettrom (talk) 16:38, 8 October 2014 (UTC)

What is WikiBrain?[edit]

WikiBrain is a Java software library we created to democratize access to state-of-the-art Wikipedia-based algorithms and technologies. WikiBrain downloads, parses, stores, and analyzes Wikipedia data in any language, providing access to state-of-the-art NLP, AI, and GIScience algorithms with the click of a button on commodity hardware. The project has a robust existing codebase, broad support from researchers, and has been well received by the Wikipedia research community. WikiBrain is described in more depth in our 2014 WikiSym / OpenSym publication [1] and the WikiBrain website.

What would this WikiMedia Engagement Grant support?[edit]

This project would support a (probably web) API designed specifically for Wikipedia researchers and tools developers.

WikiBrain now targets NLP / AI / GIScience researchers, allowing them to integrate and extend state-of-the-art Wikipedia-based algorithms. Although we strive to make software installation as easy as possible, some barriers to integration still exist for Wikipedia researchers and tool developers. For example, importing several large language editions of Wikipedia can take a day or more and 500GB of disk space. In addition, WikiBrain is written in Java, making it difficult to communicate with software written in Python or PHP (common choices for researchers and tool developers).

We would create a REST web API to WikiBrain exposing features that are valuable to researchers and tool developers. We would also develop "client libraries" for the API in Python (and perhaps PHP) that would enable researchers and bots to immediately access a wide range of efficient Wikipedia-based algorithms. For more details, please see the full WikiBrainTools individual engagement grant.

A partial description of WikiBrain features to get your brain going[edit]

To help give you some context, here's a (partial) list of features WikiBrain currently supports that could be easily made available through the API. We are also open to adding new features. We don't list features already provided by the Wikipedia API below.

  • Algorithms on basic Wikipedia data structures:
    • Pagerank values for articles and categories to estimate their "importance."
    • Distance between two pages (or categories) in the category or link graphs.
    • Disambiguation of phrases to articles, supporting many more phrases than just disambiguation pages (e.g. Obama -> Barack_Obama ).
    • Mapping of an article to the "best" top-level category.
    • Very fast basic querying of the category and link graphs.
  • Semantic algorithms:
    • Similarity score between two phrases or articles.
    • Cosimilarity matrix between a collection of phrases or articles.
    • Given a phrase or article, return the most similar articles to it.
  • Spatial algorithms:
    • Rich polygonal representations of the state / country / continent associated with a geo-tagged article.
    • List of geo-tagged articles.
    • All geo-tagged articles contained within a particular state / country / continent.
    • All geo-tagged articles contained within some user-specified polygon.
    • Geodetic distance between two articles, considering polygonal geographic features.
  • Multi-lingual:
    • Access to the full multi-lingual article graph.
    • Given two articles, what languages link between them?
  • Wikification:
    • Given a piece of text, extract likely "links" to Wikipedia concepts.
  • Page view data:
    • Number of page views per article for a time period, including redirects to it.
  • Wikidata
    • Facts about articles
  • Articles that match query (e.g. musicians from Japan)
    • May need to be reconciled with the official Wikidata web API.

References[edit]

  1. WikiSym 2014: Sen, S., Li, T., Hecht, B. 2014. WikiBrain: Democratizing computation on Wikipedia. Proceedings of the 10th International Symposium on Open Collaboration (OpenSym / WikiSym 2014). New York: ACM Press.