What are you trying to achieve?
As part of the annual plan TEC9: Address knowledge gaps, we have implemented the paper Growing Wikipedia Across Languages via Recommendation and want to take the implementation to production. We want to generate article recommendations for creation and make these recommendations available via a REST API.
What does the implementation look like?
The implementation repository contains a set of scripts (with instructions on how to run them) that generate article recommendations for a given language pair in TSV format. Development and testing of these scripts were done using pyspark in stat1007. Here you can learn more about how it works.
How does the production pipeline work?
Please see the diagram below.
What are you going to do with the TSV files?
We want to import these TSV files into MySQL and expose the data using the REST API. These TSV files can range anywhere from a few megabytes to about 200 MB per language pair. When there are fewer overlapping articles between the language editions of Wikipedias, the file size gets bigger.
Tell me more about the MySQL database and the REST API that's using it
How are database tables created?
We have a repository that creates the database tables and imports data into the database. We want to create an Oozie job to automate these processes.
How is data going to be imported?
Via an Oozie job that imports TSV files from HDFS into MySQL. We'll be working with Analytics to create an extensible solution that works with not only MySQL, but other storage systems, although in this setup we only need to support MySQL. Other storage backends maybe needed in the future.
How are article normalized scores updated?
In order to generate new set of recommendations, we need to have access to fresh Wikidata dumps. Once we generate new recommendations, we'll import those into MySQL into versioned tables which are postfixed with a date. We also have canonical views that point to one of these versions of the tables. These views are being used by the REST API.
Creating versioned tables as opposed to having one table with versioned rows allows us to save space because the data version will only be in the table name and not in each row of the data. Another upside is that we can easily drop old versions of the data without locking up the database. Backing up specific versions of the recommendation is also easier this way.
Once the new data is in the database, we'll monitor its usage and pay attention to any issues it may have. If no problems are found, we can drop older versions of the tables and only keep the new data. In case of errors, we can easily point the views to an older version of the data, or directly point to these older tables in the REST API configuration.
These tasks will be part of the deploy repository and will be automated.
How are you going to generate data in the Analytics cluster?
We want to watch for new Wikidata dumps and generate new recommendations every quarter (tentatively). We'll adapt the implementation repository to work in the cluster and automate the process.