Learning patterns/Using Wikidata to make Machine Translation dictionary entries

From Meta, a Wikimedia project coordination wiki
A learning pattern forproject management
Using Wikidata to make Machine Translation dictionary entries
problemExpanding MT coverage with Wikipedia-relevant translations
solutionUsing Wikidata to make lists of candidate translations
created on08:52, 21 June 2016 (UTC)

What problem does this solve?[edit]

When working on Apertium MT systems for use in Content Translation, we want as many as possible bilingual dictionary entries that have direct relevance for translating Wikipedia articles. Wikidata's interwiki labels give us some of that, although the data needs some parsing and manual checking. We can ease the checking by sorting by string difference.

What is the solution?[edit]


You'll need a Unix machine (Linux/Mac) with the regular command line tools (bzcat, grep, sed, gawk), and the JSON command-line processor jq.

NB: If all that sounds a bit of a hassle and/or you're not very familiar with the command line, but you still want to help with checking translations for your favourite translation direction, you can get in touch with someone who can help prepare files to check.

Parsing/filtering data[edit]

Extracting entity labels of a certain category[edit]

The page https://www.wikidata.org/wiki/Wikidata:Database_download lists Wikidata dumps; downloads are at https://dumps.wikimedia.org/wikidatawiki/entities/

After downloading, you can run e.g.

 bzcat wikidata-20160229-all.json.bz2 \
  | grep '^{' |sed 's/,$//' \
  | jq -c '{ "sv":.labels.sv.value, "nn":.labels.nn.value }' 

to get some translation labels out. The grep|sed makes each line a simple JSON object, which lets jq easily process it without reading the whole dump into memory. But we don't want every single entity – that would be a lot of work to manually check, and by just looking at labels we can't even be sure what they're referring to (which would make translation-checking difficult). It's much more efficient to take a single category at a time.

Say we want to get some place-names/toponyms; a simple way to get only toponyms is to check that the entry "claims" a coordinate location, ie.

$ bzcat wikidata-20160229-all.json.bz2 \
  | grep '^{' |sed 's/,$//' \
  | jq -r 'select(.claims.P625 and .labels.sv and .labels.nn)  # only include if both labels and the right property
           | [ .labels.sv.value, .labels.nn.value ]            # if included, output only the labels
           | @tsv                                              # and turn into tab-separated values
           ' > P625-sv-nn.tsv

This will ensure the entity itself claims a co-ordinate location. It'll skip entities that just have some sub-property that claims a co-ordinate location, which is good, since we don't want to include Jorge Luis Borges just because his burial place has a co-ordinate location.

Getting only labels that are actually used[edit]

The file P625-sv-nn.tsv is now checkable, but we may want to further transform the list such that we only include entries that have appeared in Swedish Wikipedia. This little gawk script runs through a Wikipedia dump to extract only entries which have been seen at least once:

bzcat svwiki-20160111-pages-articles.xml.bz2 \
  | tr '[:space:][:punct:]' '\n'             \
  | gawk -v dict=P625-sv-nn.tsv '
while(getline<dict)d[$1][$2]++} {
  # Make uni, bi, tri, quad, pentagrams:
  bi=p1" "$0;tri=p2" "bi;quad=p3" "tri;pent=p4" "quad;
  # and shift:
  for(src in h){delete h[src]}
  for(src in h)if(src in d)seen[src]++
  PROCINFO["sorted_in"] = "@val_num_desc"
  for(src in seen)for(trg in d[src]) {
    # Exclude those that have been seen less than, say, 2 times:
    if(seen[src]>2)print src,trg,seen[src]}
' > freq-P625-sv-nn.tsv

Sorting by dissimilarity[edit]

Finally, we can sort the list by edit distance (string dissimilarity), to ease the manual checking, since more similar place names are less likely to be bad translations:

# If you didn't run the frequency filter step above, 
# then you can just use the first P625-sv-nn.tsv file here:
<freq-P625-sv-nn.tsv gawk '

# A helper function from http://awk.freeshell.org/LevenshteinEditDistance
# You dont have to understand this :)
function levdist(str1, str2,    l1, l2, tog, arr, i, j, a, b, c) {
        if (str1 == str2) {
                return 0
        } else if (str1 == "" || str2 == "") {
                return length(str1 str2)
        } else if (substr(str1, 1, 1) == substr(str2, 1, 1)) {
                a = 2
                while (substr(str1, a, 1) == substr(str2, a, 1)) a++
                return levdist(substr(str1, a), substr(str2, a))
        } else if (substr(str1, l1=length(str1), 1) == substr(str2, l2=length(str2), 1)) {
                b = 1
                while (substr(str1, l1-b, 1) == substr(str2, l2-b, 1)) b++
                return levdist(substr(str1, 1, l1-b), substr(str2, 1, l2-b))
        for (i = 0; i <= l2; i++) arr[0, i] = i
        for (i = 1; i <= l1; i++) {
                arr[tog = ! tog, 0] = i
                for (j = 1; j <= l2; j++) {
                        a = arr[! tog, j  ] + 1
                        b = arr[  tog, j-1] + 1
                        c = arr[! tog, j-1] + (substr(str1, i, 1) != substr(str2, j, 1))
                        arr[tog, j] = (((a<=b)&&(a<=c)) ? a : ((b<=a)&&(b<=c)) ? b : c)
        return arr[tog, j-1]

{ print levdist($1,$2),$0}
' |sort -nr > levsorted-P625-sv-nn.tsv

The file levsorted-P625-sv-nn.tsv is now ready for manual checking.

Manual checking[edit]

If you've done the above, you can then go through your levsorted-P625-sv-nn.tsv, and just remove (or fix) bad lines. Lines that are very similar and thus most likely OK should be at the bottom of the file, the most likely errors (but also good translations that happen to be very different) at the top.

When you're done, you can send it to an Apertium committer – you can find some ways to get in touch with committers at http://wiki.apertium.org/wiki/Contact or just put the file on a sub-page of your wikipedia user and send a message to people like User:Unhammer or User:Francis_Tyers.

Things to consider[edit]

Lists from Wikidata do still need manual checking:

  • names might have errors, e.g. the Swedish Wikipedia has a lot of bot-generated articles with names that are in English instead of Swedish (e.g. https://sv.wikipedia.org/wiki/Mushketov_Glacier which should probably be "Mušketov glaciär" or similar).
  • sometimes names in one language are over-specified compared to the other, e.g. Swedish uses just "Friesland" where Nynorsk uses "Friesland i Nederland" – if we add this to the dictionary, and the Swedish input is "Friesland i Nederland" then the Nynorsk translation would become "Friesland i Nederland i Nederland".

When to use[edit]

In general, when the machine translator makes errors due to unknown words or words that are missing from the dictionaries.

When the machine translator makes a lot of errors on proper nouns / place names, the above would be perfect to expand its coverage. You can also filter on other properties of course, but then the manual checking becomes more difficult.

This pattern was created as part of the IEG project "Pan-Scandinavian Machine-assiste Content Translation".[1]


See also[edit]

Related patterns[edit]

External links[edit]