Grants:IEG/MediaWiki data browser

From Meta, a Wikimedia project coordination wiki

status: completed

Individual Engagement Grants
Individual Engagement Grants
Review grant submissions
review
grant submissions
Visit IdeaLab submissions
visit
IdeaLab submissions
eligibility and selection criteria

project:

MediaWiki data browser


project contact:

yaron(_AT_)wikiworks.com

participants:


Yaron Koren



summary:

A JavaScript framework that can be used to navigate through the structured data of Wikipedia and other MediaWiki wikis.

engagement target:

MediaWiki, all projects

strategic priority:

Improving Quality

total amount requested:

30,000 USD


2013 round 1

Project idea[edit]

There is, at the moment, no way to browse through the data of Wikipedia. Wikipedia contains vast amounts of structured data - a pre-defined set of fields for nearly every type of page, that's viewable within infoboxes - but there is no way to navigate through that data by selecting filters in the manner that we have become accustomed to, on sites like Amazon, eBay, Yelp, Craigslist and many others.

A drill-down interface like this one can be useful for a variety of reasons:

  • to help see the overall structure and layout of the data
  • to help find patterns in the data in a relatively painless way
  • to find individual pages that match some criteria
  • to find gaps in the data that need to be filled in.

This is of course useful for Wikipedia, with its vast store of general knowledge, and it's also useful for any other MediaWiki wiki that contains structured data.

Let's take a single example of data from Wikipedia: world leaders. A drill-down interface could help answer all of the following questions, that would currently be difficult to find an an answer to in any source, Wikipedia or otherwise:

  • How many national leaders have each of the continents - Africa, Asia and so on - had in total?
  • What was the average time in office for world leaders in the 18th century?
  • What is the party breakdown for Prime Ministers of Canada?
  • How many current or former world leaders are still alive?

It is very important to clarify that this project does not seek to implement a single, standalone web site or app; instead, it will provide what one could call a "framework", that can be used to easily generate such apps and web sites. The plan is that all that will be required of someone creating an app with this framework is to specify a wiki URL, one or more category names, and a set of relevant field names for each category, plus perhaps some general settings information, and the code can do the rest - no programming will be required.

Why make a framework, instead of a single standalone app? That's for three main reasons:

  1. The set of Wikipedia's data is much too big to be navigable in a single application - even if the storage system allowed it, which it wouldn't. This way, sites and apps can be created that allow for navigating through one or a handful of sets of data (like world leaders) in a controlled, curated way.
  2. This generic approach will also allow for creating apps to navigate through any other MediaWiki-based wiki's data, provided that the wiki stores information on its pages in a structured way using templates. In such a case, assuming the wiki has a number of pages in the hundreds or thousands, then it would in fact be possible to display all the data from a wiki in once place, in a sane way.
  3. It will allow for a more seamless transition once Wikipedia converts to using Wikidata's own data, by separating the logic from the underlying data.

It's worth it, at this point, to mention Wikidata. Wikidata is an extremely important project that is meant to hold a single store of data, usable across the infoboxes (and interlanguage links, and possibly other areas) of all language Wikipedias.

The rise of Wikidata will not do anything to lessen the usefulness of this project. Wikidata is intended primarily as a data store for Wikipedia articles. It will eventually allow for querying of its data, but only to a limited, controlled extent, and without any user interface element. So even once Wikidata becomes fully operational, it will still lack, as Wikipedia lacks today, a high-level navigation of its data. And Wikidata may eventually be able to answer some or all of the questions listed above, through the use of queries; but for the average user, formulating such queries may be difficult to impossible, and for any user, a drill-down interface offers a much more immediate view of the data.

How will this planned data browser avoid overloading the database, and/or crashing Wikipedia? For that, let's look at the planned technical design of the system.

It is planned that the data browser will be created in HTML5 and JavaScript. It will store a subset of Wikipedia's infobox data locally on the browser, using either of the two current browser-data-storage systems Web SQL and IndexedDB. The web app, when initially loaded, will populate the database, after which all querying of the data will happen on the browser, without any load on the server. (And that will in turn mean that, in conjunction with an offline sotrage system for the HTML and JavaScript, like localStorage, the site/app could also be used offline after being accessed.) The data that the web app retrieves will come from one or more static files, most likely in CSV but possibly also in another standard data format like JSON or XML. Those files, in turn, will get updated on a regular basis (say, once a day or once a week) from the underlying data, and this will be the only time live data is accessed, so impact on the server side should be minimal.

Once the basic framework is created, there are a number of interesting visualizations that could be added to the system. These include:

  • a map showing all pages in some set, based on their coordinate data (most likely done using OpenLayers)
  • display of pages in calendars or timelines, for those pages that have date-based data
  • display of value breakdowns in a bar chart, showing the relative popularity of different values (for example, males vs. females among world leaders)

These options have direct analogues among the visualization options offered by the Semantic MediaWiki extension. The difference here is that no tagging needs to be done within the wiki itself; and the display of data is all done outside the wiki as well.

Finally, the source code for this framework will be made publicly available, and will be released as open source under the GNU Public License.

Project goals[edit]

The goal of this project is to create a set of (open source) JavaScript code that can be used to automatically generate any number of data-specific "apps" (both web-based and device-based), that can be used to browse parts of Wikipedia/Wikidata, or any other MediaWiki-based wiki.

For "apps" that use Wikipedia/Wikidata's own data, some could potentially even be published by the Wikimedia Foundation itself, on the wikipedia.org/wikidata.org domains or elsewhere.

The ultimate goal is to provide a high-level, aggregated view of the data within both Wikipedia and other MediaWiki wikis, in a way that many other websites, and wikis that make use of Semantic MediaWiki, already do.


Part 2: The Project Plan[edit]

Project plan[edit]

Scope:[edit]

Scope and activities[edit]

The scope of this project will include all the tasks associated with software development: software design, interface design, graphic design, code creation, testing (of both software and the user experience), and release, with most likely a number of releases during the six-month period.

A development schedule for this project might look like the following:

  • After 1 month: First version of the software is released. It provides a generic interface for browsing through the structured data of a single category of a MediaWiki wiki, using at least one of the two browser database solutions (IndexedDB and WebSQL).
  • After 2 months: Capability is added to do the same for some subset of Wikidata's data, as well as for the other browser database solution, if only one existed before.
  • After 3 months: Handling is added for browsing through multiple categories, where the categories are linked to each other in some way (for example, a category of cities and a category of countries). More complex displays are also added, such as maps for coordinate data and timelines for date-based data.
  • After 4 months: The ability to turn this code into a full-fledged mobile app is created, using one or more tools such as PhoneGap. The display may be customized for the smaller screens of mobile devices as well.
  • After 5 months: Display is made smarter, able to tailor itself to the nature of the data - for instance, if there are more than 50 possible values for a given field, maybe only the most popular values will be shown; or maybe the values will be further subdivided by the first letter of their name, etc. This will most likely involve feedback from one or more user interface experts.
  • After 6 months: Administrative interface is improved, possibly with a web-based tool, to make setup of an app as easy and foolproof as possible.

Throughout the 6 months, there will be an iterative approach to development in which the entire code base, and documentation, are constantly improved, based on feedback and bug reports; which is a big part of the reason why the first set of code would be released early into the process.

Tools, technologies, and techniques[edit]

The code will be created using a combination of HTML, JavaScript and CSS. It is currently planned that the JavaScript will use a combination of the libraries jQuery and AngularJS, as well as Web SQL and IndexedDB (depending on the browser used). LocalStorage may be supported as well, to allow offline viewing. And it is planned that the final HTML and JavaScript code will be compatible with the PhoneGap application, to enable such web apps to also be seamlessly converted into mobile apps on iOS and Android devices.

Budget:[edit]

Total amount requested[edit]

$30,000.

Budget breakdown[edit]

This amount will be used to fund a six-month development effort. It will probably include hiring one or more additional developers, once the basic code design and a working prototype are created. It will probably also include the temporary hiring of a user-interface expert to help with improving the look-and-feel of the application. It may also include travel to MediaWiki- or Wikimedia-related events, or travel to the San Francisco office, if any of these are deemed desirable.

It's important to note that I am willing to take on this project for a lower amount than $30,000, if this amount is considered too high for some reason - I would rather work on a partial project than nothing at all. If the grant is for a lower amount, it simply means that I (and anyone else hired for the project) won't be able to devote as much time to it; so the full set of features may not get implemented by the end of the grant period. I think a minimum amount of money for which I'd take this project on is $15,000 - and I hope I'm not shooting myself in the foot by saying that. :)

Intended impact:[edit]

Target audience[edit]

I hope it's not an exaggeration to say that, if this project is successful, it could be of interest to literally anyone in the world. (!)

Fit with strategy[edit]

This project could fall under a number of categories:

  • Improving Quality: Aggregated information of the kind that this framework's apps are meant to show is knowledge that many people could find useful.
  • Increasing Reach: Apps created with this framework could work in an offline, mobile way, usable by people with limited internet access.
  • Increasing Participation: Seeing these sorts of aggregated listings could help to pinpoint where there are gaps in both Wikipedia and Wikidata.

Sustainability[edit]

If this project is successful, it could coalesce into a standalone open-source project with a community of its own, in the manner of Semantic MediaWiki. Or it could be adopted by one or more existing communities, such as the MediaWiki or Semantic MediaWiki communities (or the community around Wikidata and the software that powers it, Wikibase, once such a community exists). The fact that this sort of framework could be usable by any MediaWiki installation makes it potentially quite useful for business, which is a very important factor in guaranteeing the longevity of the software.

Measures of success[edit]

The measure of success for this software will of course be its usage: how many "apps" get created with it, how many viewers/users those apps get, and the level of development activity around the software once it is released.

Here are some very rough ideas about usage of the software, at the end of six months, that would constitute success, in my opinion:

  • Software fully developed and available on a code-sharing site such as GitHub.
  • At least 10 apps created by others that make use of Wikipedia/Wikidata data.
  • At least 3 such apps available as true mobile apps.
  • The software in use in conjunction with at least 25 MediaWiki wikis, public and private.
  • A community of users for this software formed, that communicates on a regular basis on either its own mailing list or on one of the existing mailing lists, such as the main MediaWiki or Semantic MediaWiki mailing lists.

If the grant total comes out to less than the full requested amount of $30,000, then it's hard to say what the measures of success would be for each potential grant amount. For simplicity's sake, feel free to simply "prorate" the numbers specified here - for instance, if the grant amount is $20,000 instead, then the measure of success would be at least 2 apps available as true mobile apps instead, etc. I assume that even a lower amount would still lead to a set of usable software by the end - though the quality of the software may vary substantially.

Participant(s)[edit]

Yaron Koren - I have significant experience with Wikipedia, MediaWiki, designing data-focused software, and software development in general. I have been doing professional software development for 15 years. Since 2007, I have a been a major contributor to the Semantic MediaWiki project. I have created over 10 MediaWiki extensions, and helped supervise the creation of about 10 more. My most well-known extension is Semantic Forms, which is in use on over 250 active public wikis. Another extension of mine, that's very relevant to this task, is Semantic Drilldown, which provides a drill-down/faceted browsing interface for the data stored by Semantic MediaWiki. I also have run WikiWorks, a MediaWiki consulting company, since 2009. I created and help to run Referata, a MediaWiki-based wiki farm. And in November 2012 I released Working with MediaWiki, a MediaWiki reference book. Finally, I should note that, through WikiWorks specifically, and the Semantic MediaWiki community in general, I have access to a deep pool of developers who could potentially help with this project, both during and after the grant period.


Part 3: Community Discussion[edit]

Discussion[edit]

Community Notification:[edit]

Please paste a link to where the relevant communities have been notified of this proposal, and to any other relevant community discussions, here.

Endorsements:[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.

  • I believe that this is an ambitious project, providing a well-elaborated plan and scope. Since it will allow for a huge impact on accessing the information collected and authored within wikis it is of great value both to the community and people outside the community. No to forget, it will increase the appreciation for the authors work since it makes it easier to benefit from it. This is what information sharing is about. WOW! --[[kgh]] (talk) 20:00, 13 February 2013 (UTC)
  • This project would go a long way toward increasing the ease of reuse and impact of Wikipedia type data. I would be especially excited about the interfaces and opportunities it creates. Vid (talk) 23:58, 13 February 2013 (UTC)
  • If I had to pick one thing I admire about Yaron it would be his ability to create the products people really have use for. Given his expeerience in both the MediaWiki and the structured data fields, he seems like the ideal person to tacke the challanges outlined in this proposal --Jeroen De Dauw (talk) 15:43, 14 February 2013 (UTC)
  • I'd be happy to see the described project realized. Based on Yaron's deep involvement with Semantic MediaWiki and MediaWiki over the years, I would have trouble of finding many people who would be better suited to tackle this task, and to produce an interesting result. --denny (talk) 14:24, 19 February 2013 (UTC)
  • Really a witty idea to deal with the navigation of the huge data stored in Wikipedia(s) and wikis. I am sure that Yaron, with his experience in design and development of good extensions can make it possible. --Dvdgmz (talk) 23:10, 21 February 2013 (UTC)
  • The idea of an app sounds wonderful, apps are the present not just websites; I would really like to see a demo of how this would look. Yaron has always been thinking ahead in usability, a fine example is SemanticForms--Nischayn22 (talk) 03:45, 25 February 2013 (UTC)
  • Yaron did not exaggerate that the impact of such a project will truly be a big paradigm shift. Having community-contributed data repurposed in the proposed data browser opens up all kinds of possibilities for the community to create special-purpose apps almost on demand. Most apps nowadays, are nothing but specialized containers/interfaces for domain-specific data. Building an app, however, is limited to developers. With his proposal, Yaron democratizes this to "anyone in the world!" I wholeheartedly endorse and support this proposal. --Jnatividad (talk) 21:13, 8 March 2013 (UTC)
  • Community member: add your name and rationale here.