General User Survey/Implementation Issues

From Meta, a Wikimedia project coordination wiki

This page aims to discuss technical issues that are involved in implementing the General User Survey Questionaire, and the storage and processing of its results.

Multilingual issues[edit]

A Mediawiki wide survey would require as many language editions as possible. 200+ versions is impracticable, but I expect at least the 10 to 25 largest projects would help out with translations.

This requires

  • Good translation coordination (recruiting volunteers, deciding on a final questionaire before the translation starts in earnest)
  • A most simple way to add and update translations (probably by grouping all texts in a separate module, which can be distributed for translation, much like Mediawiki language modules)
  • A questionaire layout that does not pose restrictions on text size, as this will vary per language
  • Timely testing of each language module
  • Ideally a way to present some choices differently per language (e.g. list of country names to select from, sorted alphabetically in the language of choice)
  • Encoding of all results in a language independant manner (e.g. the script does not store 'Easy' or 'Unsatisfactory' but numbers that signify answers to the w:Likert scale type questions. For answers on questions like 'In which country do you live: we could prefix a fixed set of valid answers with numbers. Thus a drop down box could list
314 - Afghanistan
215 - Albania
453 - Algeria
etc

In order to ease validation of results the complete entry is stored (e.g. '453 - Albania'), but processing of the results can be based on the number prefix only. In different languages the list of countries can be sorted in a different order but the same number prefixes are used. The numbers are random(?) three digit numbers on purpose: a next edition of the survey can then use the name numbers even when the list of countries will have altered.

Programming language[edit]

The first General User Survey might serve as prototype for other surveys, which later possibly could be integrated in the Mediawiki code base. Thus it seems wise to code in the Mediawiki language of choice: php.

Validation of input[edit]

Some answers can be left unanswered. Some not (?). Wherever validation is required, the script should mark erroneous or missing input in red, and return to the user, with all user input preserved, so that the user does not have to repeat the same valid input (it still happens on some sites). Clear language dependant error messages will have to be presented that explain what the user is expected to do, preferably immediately above or below the question that the remark is referring to. Multiple validation errors can be presented at the same time, all marked clearly in red.

Encoding of answers[edit]

See also above at 'Multilingual issues'. As much as possible all answers are encoded in a numeric format. Questions are also encoded numerically. Thus a next survey can add, remove or rephrase questions and still refer to them by a same number.

User comments[edit]

Here are two ways to let the user add comments:

  • The form contains a button per question that shows a (popup?) form where users can explain their answer or comment otherwise. The comments are then anonimously stored in a separate comments table, prefixed by question number. Others can browse this table per question.
  • Much simpler to implement, in line with wiki approach, but less anonymous: each question in the form is accompanied by a static url to a question specific meta page where that question can be discussed.

Storage of results[edit]

Results are stored in a dedicated MySQL table, with one row per user (or user/question pair). A new submission by the same user overwrites the first, or is not allowed (?).

Accessability of results[edit]

Most researchers seem to favour csv format for publishing public, aggregated data. The original database(s) that store the raw user input need to be treated similar to the Mediawiki user table, namely only accessable with database password on a secure server, and xml dumped to folder 'private'.

Anonimity of results[edit]

All data that will be published for processing by outside researchers need to be thoroughly anonimised. This may imply that some data are published with less granularity than which with they are stored. E.g. unique country code is stored, but public csv extract shows only continent (just an example, to be discussed). Similar year of birth may be stored, but presented in public csv file as age rounded to nearest decennium.

The table itself could contain the exact user name so that confidential stats scripts could do some fact finding like retrieving the number of edits from the projects (and publishing this again in a generalized fashion).

Authentication[edit]

The survey will be much more useful when restrictions are imposed on spamming the survey. Late 2006 Mediawiki will move to single user signon, so that each registered user is unique over all projects.

The results would benefit from user validation, either

  • by referring to a static copy of the MySQL user table, with sensitive and unneeded info removed (mail address etc)
  • or by invoking a function or script, to be supplied by Mediawiki dev team that accesses the live user table and returns 'OK to continue' or not, and a unique id for this user (possibly different from alreaey existing user number) so that each user can only submit the form once (or a second submit overwrites previous results from the same user).

Database design[edit]

It remains to be seen whether results can best be stored:

  1. in a separate record per user per question (normalisation issues involved)
  2. in one record per user with separate fields per question (adding new question means adding field(s) to the table)
  3. in one record per user with all answers in one xml structure in one large text field (flexible but less queryable with sql).
I'd vote for #1 in a table with 4 columns, i.e.
user id question id answer comment
Normalising the answer to fill unanswered, true/false or small ints shouldn't be too hard. -Sanbeg 20:55, 26 January 2007 (UTC)[reply]