Grants:IdeaLab/Reuse of citations across languages

From Meta, a Wikimedia project coordination wiki
statusstatus-default
Reuse of citations across languages
Reuse citations by alignment of prose across languages and projects
idea creatorJeblad
this project needs...
member
join
endorse
created on20:06, 5 April 2016 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

Citations are quite tedious to get right, but once done they can be reused across projects and languages. Sometimes it is obvious which citations can be reused, but usually it takes a lot of time to identify the correct citation on those 280+ languages within the Wikipedia universe.

What is your solution?[edit]

Assume that the user has set the pointer in the target text. The text in front of the cursor is then our target text that we try to find in a source text. (The target text is where we insert citations, and the source text is where we find them.)

This is only for an initial investigation of the problem, and must focus on analysis of available methods, not code quality! At least I would expect the initial investigation and content analysis to take at least three man months.

Instead of manually scanning through all source texts it is possible to align sentences leading up to the citation against the sentence before the cursor in the target text. Intuitively that alignment should be very difficult to achieve, imagine edit distances for translated text, but in fact it is possible to use a number of much simpler methods.

The text snippet leading up to the citation should be shown together with the actual citation, so that the user him-/herself can verify that the correct citation is selected. One idea is that the citation dialog in VisualEditor could have a fourth tab for reuse of external sources, much like the third tab (mw:Help:VisualEditor/User guide/Citations-Full#Re-using an existing reference) but with some small changes.

When the user picks a citation for reuse, the citation template should be translated from the version used at the source language and/or project to the version used in the target. Many projects has defined templates for such substing, and they should be made usable somehow. If they exists, and can be identified, the citation will be translated through the API-call, and then the localized template call is injected into the target text. An alternate solution can be to identify the relations through use of statements at Wikidata.

Algorithms[edit]

One of the more useful correlation techniques for such realignment is in fact to count characters in the sentences. Other methods are counting words, which usually gives worse results, and using locally sensitive hashing on trigrams of the characters. Trigrams works especially well between closely related languages.

In general we can use several features and assign some probability to them, and then sum up all of the probabilities. The idea is that we observe some feature in the target, and then try to find the same feature in sources on other languages and/or projects. If we find the feature in the text leading up to the citation in the source, then we can calculate the probability that the alignment is correct for this specific text snippet given this specific feature. By summing up the log probabilities for all detected features an overall log probability for alignment can be found. We then use this log probability as the sort order for our proposed citations from the source, and if everything works out the correct citation floats to the top.

Calculation of the probabilities are a bit more involved as we must calculate probabilities for all outcomes over all features, if we don't find the feature we must still sum the probability, otherwise multiple found features would impose a penalty. Also the UI are slightly more complex as it will not be just one list to sort, but one list per language.

There are a lot of published articles about aligning texts in different languages, and some of the algorithms are pretty easy to implement.[1][2][3]

Project goals[edit]

The core goal with this idea is to make it a lot more efficient to insert citations in the target text if there are citations in similar prose in other projects or languages. By automating the process it is possible to scan through a lot more text, which should make it more attractive to writers to actually add them. Especially citations leading to online sources should be very attractive.

To reach the core goal in full, a sub goal would be to make a gadget that automates the process to reuse citations. If that sub goal is sufficiently effective it should be possible to detect a substantial rise in the use of citations in a target language.

It is not likely that a project will reach a state where there is a well-tested UI within the given time frame. A more realistic goal is to verify that the proposed solution is doable and gives sufficiently good results.

Risks[edit]

Users in the community might not accept the solution
This is probably my biggest concern. It is often rephrased as "Not invented here". As long as this is just an investigation of the concept it is not an issue whether the final solution will be accepted by the community, but a finalized gadget must be "accepted" otherwise it will be useless. One thing in favor of the proposed solution is that the manuel alternative is quite slow.
The project might not be finished within time
Always a possibility, and I'm very good at overestimating my own work progress. It seems to me that a project with a goal of investigating the concept should be doable within the given time.
The new code might pose a security risk
If a system is implemented then API-calls would pose a security risk, but note that the calls would give read only access. A gadget to pick up the results would only add the results back during editing, and then the page would go through normal cleanup. It is possible to imagine attacks on the source wiki, saving some attack vector, then an user find this as a text snippet for a citation and gets an attack in the browser. A simple workaround could be to strip off all but the most basic tagging from the snippets.
The new code might create to high load on the system
If we chose to do the actual feature detection on the servers we could end up with several hundred feature extractors running in each of a couple of hundred text snippets. At least we need compiled and cached regexes for the feature extractors, but also an efficient approach to how we loop over the snippets.

Project plan[edit]

Activities[edit]

  1. Make a literature search for working solutions
  2. Chose and describe a few core methods in addition to a Bayes' estimator
  3. Create a basic structure for feature extraction based on regex (note that we only use a few hundred features, JSON perhaps or Wikidata?)
  4. Figure out how we store the probabilities (these are precalculated, perhaps save to a JSON page?)
  5. Figure out if we can do the calculation in the browser (this scale better)

Budget[edit]

The estimated workload is about 3 full-time person-months for an experienced developer; or 6 person-months at 50 %. This workload estimation is based on the main developer's previous experience with similar projects.

Budget breakdown[edit]

Item Description Commitment Person-months Cost
Main developer Developing and releasing proposed code Part time (50 %) 6 USD 12,400
There is no co-funding
Total USD 12,400

The item costs are computed as follows: The main developer's gross salaries (including 35 % Norwegian income tax) are estimated upon pay given to similar projects using Norwegian standard salaries,[4] given the current exchange rate of 1 NOK = 0.120649 USD, and a quarter of a year's full-time work.

Sustainability[edit]

The results will be available on-wiki. Some of the analysis must probably be rerun later on, which imply that they must be done with re-usability and maintainability criteria in mind. Thus the analysis must be documented and manuals and initial tutorials and examples must be made.

Community engagement[edit]

Other than to provide examples on possible feature extractors it is not expected that the community in general will participate very much in investigation of the problem.

It is expected that it is necessary to get feedback on the very limited UI, and to get help with translation of a few system messages.

Measures of success[edit]

  1. Has unconditional probabilities for the languages
  2. Has prior probabilities for the language pairs
  3. Can calculate the necessary posteriori probabilities
  4. Can calculate the log probability of a text snippets
  5. Can sort a list of snippets according to the log probabilities

Get Involved[edit]

About the idea creator[edit]

  • Jeblad – I'm a wikipedian with a cand.sci. in mathematical modeling, andd started editing on Wikipedia during the summer of 2005.

Participants[edit]

This idea is up for grab if anyone wants to try! :)

Endorsements[edit]

References[edit]

  1. Key, Martin; Röscheisen, Martin. "Text-Translation Alignment" (PDF). 
  2. Singh, Anil Kumar; Subramaniam, Sethuramalingam; Rama, Taraka. "Transliteration as Alignment vs. Transliteration as Generation for Crosslingual Information Retrieval" (PDF). 
  3. Ganchev, Kuzman; Graça, João V.; Taskar, Ben. "Better Alignments = Better Translations?" (PDF). 
  4. https://www.regjeringen.no/no/dokumenter/lonnstabeller/id438643/#foerti - lønnstrinn 47