Working with data in Wikimedia and MediaWiki

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
This course has ended. The Wikimedia Labs instances mentioned here and in the slides no longer exist.

Working with data in Wikimedia and MediaWiki is a course taught by Niklas Laxström and Susanna Ånäs.

Information[edit]

Place
Language Technology, Department of Modern Languages, University of Helsinki, Helsinki, Finland
Time
September-December 2016
Course info and sign-up
WebOodi

5.9. Wiki[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

Submission: send your written replies to niklas.laxstrom AT helsinki.fi using subject wmw-01 before Monday 12.9.

Content organization[edit]

  1. Go to Special:AllPages
  2. Go over all the non-talk namespaces from the namespace dropdown
  3. Open some pages from each namespace to see what kind of content there is

What do you think the namespace is used for? Do you see other patterns in the the way pages are named besides the namespace? Is there anything special about the name of the page Special:AllPages itself? Write down your observations and thoughts.

Basic wiki[edit]

I have created an uncustomized wiki installation. Compare it to this wiki and document what differences you see in the appearance and functionality. For example, try editing pages (but don't save anything). You can use Special:Version on this wiki and Special:Version of the uncustomized wiki to compare installed extension to help you find more differences.

12.9. MediaWiki[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

  1. Pick two unique names, hereafter called A and B. You can use Special:Random as an inspiration.
  2. Go to the previously empty wiki of last week's assignment.
  3. It is not necessary to register in this wiki to create pages. Create page Template:A with contents This is the _ of the page _, so that the underscores are replaced with appropriate wikicode: first one should output the content of first unnamed parameter. The second one should output the name of the current page. See the help links in the reading section or the slides.
  4. Create page B with any creative content, such as "Hello world!". Edit the page B again and use the template A twice. Place {{A|beginning}} in the beginning and {{A|end}} at the end of the page and save your edits.
  5. Document how to use the Template:A using <noinclude> tags.

Make sure the page looks okay, for example that the text does not run together. Send the link to page B by email per instructions above.

Submission: send your answers to niklas.laxstrom AT helsinki.fi using subject wmw-02 before Monday 19.9.

19.9. MediaWiki extensions[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

If you are sure that you are going to install MediaWiki on your own, you can skip these steps, but do send an email to inform me that you are doing it.

  1. Familiarize yourself with Wikimedia Labs terms of use
  2. Create a Wikimedia Labs account (also known as Wikitech account)
  3. Create a ssh key if necessary and set it up for Wikimedia Labs

Submission: send your account name to niklas.laxstrom AT helsinki.fi using subject wmw-03 before Monday 3.10.

26.9. Wikimedia projects[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

  1. Define a list, map or a timeline for a topic. Choose a topic that could illustrate a Wikipedia article.
  2. Make a Wikidata query that returns all necessary information.
  3. Include dates, locations (points or areas) and images in the query.

Submission: send a link to your query to niklas.laxstrom AT helsinki.fi using subject wmw-04 before Monday 3.10.

3.10. Lists, maps and timelines[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

  • Create or polish your map, timeline or list
  • Fix the data, if needed

Listeria list[edit]

  1. Create a Listeria list in your preferred wiki
  2. Add parameters
    • Use only one SELECT parameter: ?item
    • Listeria will take care of language, multiple values, grouping etc.
  3. Insert the list to your preferred wiki

Histropedia timeline[edit]

  1. Create a Histropedia timeline of your SPARQL query
  2. Add parameters
    • Add parameter name without the question mark as title, URL, dates, image etc. Use the textual representation for texts, not the ID.
    • You can group items based on one parameter.
  3. Insert a link to your preferred wiki

Kartographer <maplink> or <mapframe> map with SPARQL query and geoshapes from OpenStreetMap [edit]

Examples

  1. Where is Finland?
  2. Helsinki neighbourhoods
  3. Municipalities of Finland

Home assignment option

  • Create a map based on Finnish municipalities or neighbourhoods of Helsinki
  • Use parameters ?img, ?title, ?description, ?link and ?fill in your SPARQL query
  • All municipality geoshapes are needed for this to work, therefore, take part in talkoot!
  • It may take up to 2 days for the geoshapes to appear in the map

Additional talkoot for everyone![edit]

  • Create an account in OpenStreetMap
  • In OSM, add 15 Wikidata IDs to OpenStreetMap features for Finnish municipalities. See this blog post for help.
    • Log in
    • Go to edit mode
    • Search for your municipality
    • Select the administrative unit (an area) from the list if there are several options
    • Add field: "Wikipedia". Use any language, select the name of the municipality. Wikidata ID follows automatically.
    • If Wikipedia article exists, but no Wikidata ID, select the Wikipedia article again, and the Wikidata ID appears.
    • Remember to save
  • For those who have already completed their first 15 and those who have not started, select your set from the sets below, approx. 15 items :)
Cities Reserved by Completed!
Akaa–Evijärvi taken done
Finström–Hattula Kim done
Hausjärvi–Iisalmi taken done
Iitti–Joroinen Virpi done
Joutsa–Kangasniemi taken done
Kankaanpää–Keminmaa Julia done
Kemiönsaari–Konnevesi
Kontiolahti–Kyyjärvi
Kärkölä–Leppävirta Anna done
Lestijärvi–Maalahti Ville done
Maarianhamina–Mänttä-Vilppula Niklas done
Mäntyharju–Padasjoki
Paimio–Pirkkala Sabine done
Polvijärvi–Pyhäranta Kim done
Pälkäne–Ruokolahti Susanna done
Ruovesi–Siikajoki Susanna done
Siikalatva–Sysmä Susanna done
Säkylä–Tuusula Susanna done
Tyrnävä–Vesanto Susanna done
Vesilahti–Äänekoski Susanna done

Submission: send a link to your list, timeline or map to niklas.laxstrom AT helsinki.fi using subject wmw-05 before Monday 10.10.

Comments[edit]

  • Data may modeled differently in different cases
  • Many ways to deal with duplicates
  • Good for education purposes, for visual learners. Specifically history teaching. Also high school level.

10.10. Extracting content[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

Produce a plain text dump of Finnish Wikipedia articles having a name that starts with Abe. Do not include redirects.

Place the extracted text of each article to a separate file named after the article. Remove characters such as (, ), or & that can be problematic in file names. Use UTF-8 encoding.

You can use the database dumps or the API. Use latest version of the article available with your source.

You can use mediawiki-utilities and/or other libraries (for example BeautifulSoup) or programming languages. The goal is to extract sentences from the articles. All wikitext mark-up or HTML mark-up should be removed as much as possible, as well as headings, infoboxes, citations, tables, etc.

If you decide to use the dumps, you can do this exercise on prugna.wmwcourse.eqiad.wmflabs (how to access), where the dump file is under /data and mediawiki-utilities and BeautifulSoup4 is already installed. You need to use python3 command to run your script. Since just iterating the dump takes over 5 minutes, consider splitting your script into two parts: first extract the relevant pages with their content, then clean-up the output.

Write down notes about problematic cases that you encounter. Finally, give an estimate how long it would take to do this kind of dump from all of Finnish Wikipedia.

Submission: send your notes, and script and text files in an archive to niklas.laxstrom AT helsinki.fi using subject wmw-06 before Monday 17.10.

17.10. Wikimania[edit]

There is no lecture on 17.10.

Wikimania is the largest annual conference of the Wikimedia movement. It has presentations on both technical and social topics and it provides a window to what is happening the movement.

Home assignment[edit]

Watch 2 or 3 presentations from Wikimania 2016 based on your interest. Summarize each presentation and what you learned in a few paragraphs. Be prepared to share highlights with others on the next lecture.

Submission: send your summaries to niklas.laxstrom AT helsinki.fi using subject wmw-07 before Monday 31.10.

24.10. Period break[edit]

There is no lecture on 24.10.

31.10. Semantic MediaWiki[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

You have received name of your Vagrant wiki on the lecture or via email. If you have not, contact Niklas.

  1. Check that http://wmwcourse-name.wmflabs.org has a working wiki.
  2. Connect to your server name.wmwcourse.eqiad.wmflabs with ssh. See wikitech:Help:Getting_Started#Project_Instances for how to do this, if you haven't already.
  3. Enable the semanticmediawiki role with cd /srv/mediawiki-vagrant; vagrant roles enable semanticmediawiki && vagrant provision.
  4. Log in to your wiki using admin account and change the password.
  5. Add some pages with semantic annotations to your wiki using the template approach. For example countries and capitals, but feel free to use imagination.
  6. Create a page with semantic query ({{#ask:}}) that displays some data from those pages. For example countries with their capitals and population in descending order.

Submission: send link to your query page to niklas.laxstrom AT helsinki.fi using subject wmw-08 before Monday 7.11.

7.11. Forms[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

Use same vagrant wiki as you did last week.

  1. Check that http://wmwcourse-name.wmflabs.org has a working wiki.
  2. Connect to your server name.wmwcourse.eqiad.wmflabs with ssh.
  3. Install Page Forms. It does not have a role yet, so we are going to install it manually.
    1. Go inside your Vagrant virtual machine: cd /srv/mediawiki-vagrant; vagrant ssh
    2. Download PageForms extension cd /vagrant/mediawiki/extensions; git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/PageForms
    3. Exit the virtual machine: exit
    4. Register the extension (you can use your favorite editor) by creating a new settings file: nano /srv/mediawiki-vagrant/settings.d/20-pageforms.php with contents:
      <?php
      
      wfLoadExtension( 'PageForms' );
      
    5. Check Special:Version of your wiki to confirm it is installed properly.
  4. Use Special:CreateForm to create a new form
  5. Edit your form page to better suit your input by selecting input types, possible values etc.

Submission: send link to your form page to niklas.laxstrom AT helsinki.fi using subject wmw-09 before Monday 14.11.

14.11. Translate extension[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

Use same vagrant wiki as you did last week.

  1. Connect to your server name.wmwcourse.eqiad.wmflabs with ssh.
  2. Install MediaWiki Language Extension Bundle. It does have a vagrant role.
    1. Go inside your Vagrant virtual machine: cd /srv/mediawiki-vagrant; vagrant roles enable mleb; vagrant provision
    2. Add some basic configuration: nano /srv/mediawiki-vagrant/LocalSettings.php with contents:
      $wgGroupPermissions['user']['translate'] = true;
      $wgGroupPermissions['user']['translate-messagereview'] = true;
      $wgGroupPermissions['user']['translate-groupreview'] = true;
      $wgGroupPermissions['user']['pagetranslation'] = true;
      $wgTranslateDocumentationLanguageCode = 'qqq';
      $wgExtraLanguageNames['qqq'] = 'Message documentation';
      
    3. Check Special:Version of your wiki to confirm it is installed properly.
  3. Make your query results page and form translatable. You can use either page translation or unstructured element translation. Remember that some form labels do not support {{int}}, so it is okay to leave those untranslated.
  4. Translate your query results page and form to one language other than English.

Submission: send link to your pages to niklas.laxstrom AT helsinki.fi using subject wmw-10 before Monday 21.11.

21.11. Content Translation & Project work[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

Try Content Translation[edit]

  1. Log in to Wikipedia
  2. Go to beta features tab in your preferences and enable content translation
  3. Go to Special:ContentTranslation and do a translation (you don't need to publish)
  4. Write comments answering the following questions:
    1. Did you encounter any bugs or issues during translation
    2. Compare the actual source article and what you see in the translation tool's source column. What differences there are?
    3. Now that you have tried different kind of translation tools, what are the benefits and downsides of each tool?
    4. How would you decide which tool to use for different types of content?
    5. If you decide to publish your translation, include a link to the published page

Choose a data set[edit]

If you want to do a project work, choose a data set. Refer to the slides for what is available.

Submission: send your answers to your pages to niklas.laxstrom AT helsinki.fi using subject wmw-11 before Monday 28.11.

28.11. Pywikibot and tips for running MediaWiki[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

  • No home assignment this week.

5.12. Examples on subobjects and custom parser functions[edit]

Slides[edit]

Reading[edit]

Home assignment[edit]

  • No home assignment this week either.

12.12. Guest presentation, course summary and life after the course[edit]

Slides[edit]