Jump to content

Research:Ideas/Public query interface for Labs

From Meta, a Wikimedia project coordination wiki


This page documents a proposed research project.
Information may be incomplete and may change before the project starts.

The MySQL databases available on labs have a sanitized, up to date copy of all the Wiki's databases. This is incredibly useful for researchers. However, to access them right now, they have to get a Toollabs account, setup an ssh key, understand how ssh works (or how SSH tunneling works), and then write their queries. Furthermore, the queries are not shareable or historically viewable in any nice way.

The goal of this project is to make an easy to use, audited, secure way for researchers to run queries against Labs databases. A mandatory requirement is agreeing to Labs' ToS, and prevention of DDoS attacks against the databases.

Requirements

[edit]
  1. Users should be auditable - repeat offenders can be banned by anyone with appropriate access to pre-existing auth infrastructure.
  2. Queries should not DoS the databases.
  3. SQL queries used should be public and shareable, so other people can repeat / improve on them.
  4. The Web interface should be easy to use for researchers who know just SQL

Proposed solution

[edit]
  1. Build a simple Web Interface for researchers to execute SQL queries
  2. Will need a Wikitech user account with shell right. Users can be blocked by revoking this access or being blocked. Shell right is also manually given at this time. This will also require them to agree to Labs' ToS.
  3. The code that runs the queries will put several restrictions in place to prevent DoSing:
    1. Aggressively kill queries that take a long time. Researchers who want to run queries that take longer can always get a Toollabs account.
    2. Put in place limits on LIMIT clauses (and ensure that all SELECTs have one)
  4. Make all queries public and shareable via a URL (similar to jsbin). Other users can 'fork' the query, modify it for their own purposes, and run them again.

This should make both Researchers and Opsen happy, I believe :)

Support needed

[edit]
  1. If you are a researcher and think this project will be useful to you or other researchers you know, please show your support! And also tell us what features you think are important.
  2. If you're a developer and want to work on a nice, distributed project that'll be used by a lot of people, sign up to help write the code!

Endorsements

[edit]
  • This system would provide critical resources for Wikipedians, Developers and new Wiki-researchers. It will allow them to get their hands dirty with data in an extremely lightweight way. I fully support this work and look forward to seeing the project be successful. I'll certainly be an alpha tester and I'll personally work to advertise and help others use the tool. --Halfak (WMF) (talk) 18:09, 24 July 2014 (UTC)
  • I'm super excited about this proposal! As a researcher (and advisor of graduate students) who lacks amazing SQL skills, I often find myself seeking a sandbox/simplified resource for running queries against the Labs dbs. I support this. Aaronshaw (talk) 12:03, 6 August 2014 (UTC)
  • This will be a very useful tool. Helder 21:22, 7 August 2014 (UTC)
  • A very great idea. Southparkfan 11:11, 8 August 2014 (UTC)

References

[edit]